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OF  A  DISCRETE-TIME  SINGLE-SERVER  NETWORK 
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by 

Armand  M.  Makowski1  and  Adam  Shwartz2 

ABSTRACT 

This  paper  considers  a  discrete-time  system  composed  of  K  infinite  capacity  queues  that  compete  for  the 
use  of  a  single  server.  Customers  arrive  in  i.i.d  batches  and  are  served  according  to  a  server  allocation  policy.  Upon 
completing  service,  customers  either  leave  the  system  or  are  routed  instantaneously  to  another  queue  according 
to  some  random  mechanism.  As  an  alternative  to  simply  randomized  strategies,  a  policy  based  on  a  stochastic 
approximation  algorithm  is  proposed  to  drive  a  long-run  average  cost  to  a  given  value.  The  underlying  motivation 
can  be  traced  back  to  implementation  issues  associated  with  constrained  optimal  strategies. 

A  version  of  the  ODE  method  as  given  by  Metivier  and  Priouret  is  developed  for  proving  a.s.  convergence 
of  this  algorithm.  This  is  done  by  exploiting  the  recurrence  structure  of  the  system  under  non-idling  policies.  A 
probabilistic  representation  of  solutions  to  an  associated  Poisson  equation  is  found  most  useful  for  proving  their 
requisite  Lipschitz  continuity.  The  conditions  which  guarantee  convergence  are  given  directly  in  terms  of  the 
model  data.  The  approach  is  of  independent  interest,  as  it  is  not  limited  to  this  particular  queueing  application 
and  suggests  a  way  of  attacking  other  similar  problems. 
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I.  INTRODUCTION 


1.1  Stochastic  approximations  on  Markov  chains 

In  recent  years,  there  has  been  widespread  interest  in  stochastic  approximation  algorithms  as 
a  means  to  solve  complex  engineering  problems  [5,15].  As  a  result  of  the  increasing  complexity  of 
applications,  focus  has  shifted  from  the  original  Robbins-Monro  algorithm  to  more  sophisticated 
versions,  and  this  has  led  in  particular  to  the  study  of  projected  stochastic  approximation  algorithms 
driven  by  Markovian  “noise”  or  “state”  processes. 

Such  algorithms  have  the  following  form.  Let  {r/(n),  n  =  0, 1, . . .}  be  the  sequence  of  iterates 
which  take  values  in  a  compact  convex  subset  U  of  1RP,  and  let  {X(n),  n  =  0, 1, . . .}  be  the  state 
process  which  takes  values  in  some  Borel  subset  S  of  They  are  related  by  the  recursion 

r?(0)  6  U  ,  r](n+  1)  =  Tlu^n)  +  an+if(p(n),  X(n  +  1))}  n  =  0, 1, . .  .(1.1) 

where  II u  denotes  the  nearest-point  projection  on  U,  f  is  a  Borel  mapping  U  X  S  — >  IRP  and 
the  step  size  sequence  (an+i,  n  =  0,1,...}  satisfies  the  conditions  0  <  an  j,  0,  a«  =  00 

and  a2n  <  oo.  A  complete  specification  of  the  algorithms  (1.1)  requires  that  the  conditional 

probability  distribution  ofX(n  +  l)  given  (X(0),  j?(0),X(1),  . . .,  X(n),  r)(n))  be  postulated  for  each 
n  =  0, 1, . . ..  For  instance,  the  Markovian  dependencies  alluded  to  earlier  require 

P[X(n  +1)6  2?|X(0),  77(0),  X(l), . . . , X(n),  v(n)]  =  ^(n)(X(n);  B )  n  =  0, 1, ...  (1.2) 

for  every  Borel  subset  B  of  S,  where  {/z,,,  7?  6  U]  is  a  family  of  one-step  probability  transition 
kernels  on  S. 

The  central  question  in  the  theory  of  stochastic  approximations  is  concerned  with  the  conver¬ 
gence  properties  of  the  iterate  sequence  n  =  0,1,...}.  For  the  Robbins-Monro  algorithm, 

direct  martingale  arguments  have  been  given  by  Gladyshev  [10]  to  establish  a.s.  convergence.  How¬ 
ever,  in  more  complex  situations  such  as  (1.2),  the  direct  probabilistic  approach  does  not  work, 
and  this  failure  prompted  the  development  of  the  so-called  ODE  method.  In  most  of  its  forms,  the 
ODE  method  proceeds  in  two  separate  steps.  The  first  step  relies  on  the  Kushner-Clark  Lemma  in 
order  to  identify  a  deterministic  ODE,  the  stability  properties  of  which  determine  the  limit  points 
of  {rj(n),  n  =  0,1,...}.  The  second  step  is  probabilistic  in  nature  and  depends  on  the  algorithm 
being  considered;  its  purpose  is  to  show  that  asymptotically  (in  the  mode  of  convergence  of  interest) 
the  output  sequence  of  the  original  algorithm  behaves  like  the  solution  to  the  ODE. 
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In  their  monograph  [16],  Kushner  and  Clark  have  given  general  conditions  for  successfully 
completing  this  second  step.  In  more  structured  situations  [15],  Kushner  has  shown  how  weak  con¬ 
vergence  methods — through  various  tightness  properties —  pave  the  way  to  convergence  in  proba¬ 
bility  of  the  sequence  {r}(n),  n  —  0, 1, . . In  the  Markovian  case,  Metivier  and  Priouret  [21]  have 
established  a.s.  convergence  by  making  use  of  properties  of  the  Poisson  equation  associated  with 
the  transition  kernels  {pv,  p  6  17}  appearing  in  (1.2).  Key  to  their  analysis  are  various  properties 
of  Lipschitz  continuity  (in  p)  of  the  solution  to  this  Poisson  equation. 

Unfortunately,  in  all  these  references,  the  conditions  underlying  the  second  step  of  the  ODE 
method  are  given  in  implicit  form  and  are  often  hard  to  verify  in  specific  situations.  What  seems 
desirable  is  a  more  operational  convergence  theory  where  conditions  are  given  directly  in  terms  of 
the  model  data.  For  instance,  this  was  done  by  the  authors  in  the  Markovian  situation  [17]  when 
the  state  space  S  is  finite,  in  which  case  (1.2)  reduces  to 

P[X(n  +  1)  =  y\X(0),p(0),X(l),..  .,X{n),p(n)]  =  px{n)y(p(n)),  y  €  S  n  =  0,1,  ...(1.3) 

for  some  family  {P(p),  r)  £  U}  of  one-step  transition  probabilities  with  P(p)  =  (pxy(p))-  A 
comprehensive  convergence  theory  was  developed  under  the  mild  condition  of  Lipschitz  continuity 
for  the  one-step  transition  probabilities  p  — >  pxy{p)-  This  was  achieved  by  using  a  variant  of  the 
approach  proposed  by  Metivier  and  Priouret,  and  leads  to  an  a.s.  convergence  result. 

When  the  state  space  S  is  countably  infinite,  the  situation  is  much  more  difficult  and  no  general 
results  seem  available,  which  guarantee  a.s.  convergence  in  terms  of  explicit  conditions  on  the  model 
data.  The  main  technical  difficulty  in  the  approach  of  Metivier  and  Priouret- — used  successfully  in 
the  finite  case  [17] —  stems  from  the  fact  that  several  quantities  of  interest  are  no  longer  bounded 
and  that  the  requisite  properties  of  the  solution  to  the  Poisson  equation  are  now  much  harder  to 
obtain.  This  paper  presents  arguments  for  establishing  both  these  smoothness  properties  and  the 
a.s.  convergence  of  the  algorithm.  The  general  framework  of  interest  is  described  in  Section  2,  and 
is  couched  in  the  formalism  of  the  theory  of  Markov  decision  processes;  this  is  done  for  notational 
convenience  as  will  become  apparent  in  later  sections.  The  approach  advocated  here  relies  on  the 
recurrence  structure  of  the  (controlled)  system  [19],  and  on  a  probabilistic  interpretation  of  the  so¬ 
lution  to  the  Poisson  equation  derived  from  it  [27].  These  arguments  are  developed  in  the  context 
of  an  adaptive  control  problem  for  a  specific  queueing  system,  namely  a  discrete-time  single-server 
network  with  random  routing,  which  is  described  in  Section  3.  The  approach  presented  here  is  of 
much  wider  applicability  and  should  be  of  use  in  analyzing  a  large  class  of  projected  stochastic 
approximations  driven  by  a  Markov  chain  on  a  countable  state  space.  The  main  advantage  of 
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discussing  a  concrete  application  rather  than  the  general  situation,  lies  in  the  fact  that  the  key 
arguments  can  then  be  provided  in  their  simplest  form,  unencumbered  by  often  confusing  techni¬ 
calities,  under  verifiable  conditions  given  solely  in  terms  of  the  model  data.  However,  to  help  the 
reader  apply  the  ideas  proposed  here  to  other  situations,  each  one  of  the  Sections  5-7  ends  with  an 
outline  of  more  general  technical  conditions  which  permits  a  development  similar  to  the  one  given 
here. 

1.2  A  time-sharing  queueing  system 

The  queueing  system  considered  here  is  now  briefly  described;  a  precise  model  formulation  is 
available  in  Section  3:  Consider  a  system  composed  of  K  infinite  capacity  queues  that  compete 
for  the  use  of  a  single  server.  Time  is  slotted  with  the  service  requirement  of  each  customer 
corresponding  exactly  to  one  time  slot.  At  the  beginning  of  each  time  slot,  the  controller  gives 
priority  to  one  of  the  queues  according  to  some  prespecified  dynamic  priority  assignment,  and  the 
selected  queue  is  given  service  attention  during  that  slot.  However,  due  to  a  variety  of  reasons 
ranging  from  server  failure  to  exogenous  interferences,  with  a  positive  probability,  the  service  fails, 
in  which  case  the  service  of  that  customer  is  rescheduled  at  a  later  time  in  accordance  with  the 
service  allocation  policy.  When  in  a  given  time  slot  the  service  succeeds,  the  customer  is  either 
declared  serviced  and  leaves  the  system  at  the  end  of  the  slot,  or  is  routed  to  one  of  the  other 
queues  with  a  fixed  probability,  depending  on  both  source  and  destination  queues.  In  the  present 
paper,  the  failures  are  assumed  generated  through  independent  Bernoulli  processes,  with  possibly 
class-dependent  parameters,  and  this  independently  of  the  arrival  mechanism.  New  customers  may 
arrive  in  batches  which  are  modelled  as  an  arbitrary  Jv -dimensional  renewal  process;  this  captures 
possible  partial  correlations  between  arrivals  from  different  classes  in  a  given  slot. 

This  queueing  system  and  its  variants  constitute  useful  models  for  studying  resource  allocation 
issues  in  several  application  areas,  including  computer  systems  and  data  networks,  and  as  such  they 
have  received  a  great  deal  of  attention  in  recent  years.  Klimov  [13]  studied  a  continuous-time  version 
of  this  system,  and  proved  that  a  strict  priority  policy  minimizes  the  discounted  cost  associated 
with  a  cost-per-slot  linear  in  the  queue  sizes.  Tsoucas  and  Walrand  [29]  considered  an  adaptive 
version  of  Klimov’s  problem  where  the  service  distributions  are  unknown. 

The  case  where  no  routing  is  allowed  has  been  much  studied.  For  such  systems,  Sidi  and 
Segall  [28]  derived  the  joint  equilibrium  distribution  of  the  queue  size  under  a  fixed  priority  scheme. 
Several  authors  [3,4,8,11]  have  shown  that  the  /ic-rule  minimizes  a  variety  of  performance  measures 
associated  with  the  aforementioned  linear  cost  structure.  In  [22],  Nain  and  Ross  considered  the 
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situation  where  several  types  of  traffic,  say  voice,  video  and  data,  compete  for  the  use  of  a  single 
synchronous  communication  channel.  They  formulated  this  situation  as  a  system  of  K  discrete¬ 
time  queues  and  found  the  service  allocation  strategy  minimizing  the  long-run  average  of  a  linear 
expression  in  the  queue  sizes  of  K  —  1  customer  classes,  under  the  constraint  that  the  long-run 
average  queue  size  of  the  remaining  customer  class  does  not  exceed  a  certain  Value.  Extending 
some  of  the  optimality  results  from  Baras,  Ma  and  Makowski  [4],  they  showed  that If  the  constraint 
can  be  met,  then  the  optimal  policy  g  is  a  Markov  stationary  policy  with  the,  following  structure: 
There  exist  two  static  work-conserving  service  assignment  policies  (of  whi£h  //c-rules  are  only  one 
description),  say  7j  and  g,  and  a  scalar  g *  in  (0, 1).  At  the  beginning  of  each  time  slot,  a  coin  with 
bias  g*  is  flipped,  and  the  policy  g  implements  channel  rights  according  to  the  outcome  via  Ij  and  g 
with  probability  g*  and  1  —  77* ,  respectively.  The  bias  g*  is  determined  so  as  to- meet  the  constraint. 
This  result  was  extended  by  Altman  and  Shwartz  to  the  case  where  the  constraint  is  also  given 
through  a  linear  combination  [1,2].  v' 


These  results  are  typical  in  the  broader  context  of  MDPs  in  that  analysis  biften  identifies  a 

H-r.  ' 

policy  g  of  interest  which  is  Markov  stationary.  In  fact,  for  the  problem  of  minimizijtg  one  average 
cost  subject  to  a  constraint  on  another  such  cost,  an  optimal  policy  which  ‘‘mixes”  two  deterministic 


•v, 


policies  in  the  manner  described  above  exists  under  very  general  conditions  as  demonstrated  by 
several  authors  [6,  7,  24].  Unfortunately,  this  policy  may  not  be  readily  implementable  due  either 
to  a  lack  of  knowledge  of  the  actual  values  of  some  parameters  [14]  or  to  computational  difficulties 
inherent  to  its  definition.  The  situation  treated  by  Nain  and  Ross  is  a  good  case  in  point,  for  there 
non-trivial  off-line  computations  are  required  in  order  to  actually  compute  the  value  of  the  bias  g*, 
even  if  all  parameters  are  known. 


1.3  Overview  of  the  paper 


This  implementation  issue  provides  the  motivation  for  the  stochastic  approximation  studied  in 
this  paper.  In  Section  4,  the  issue  is  discussed  in  the  broader  context  of  “steering  the  cost  to  a 
given  value”,  with  a  view  towards  applications  to  constrained  optimization  [1,2,22].  The  problem 
is  now  one  of  finding  the  bias  g*  needed  in  a  simple  randomization  between  two  policies  g  and 
7}  in  order  to  steer  a  long-run  average  cost  to  a  given  value.  The  resulting  randomized  Markov 
stationary  policy — denoted  g  hereafter— can  be  implemented  by  means  of  a  projected  stochastic 
approximation.  This  algorithm  computes  on-line  estimates  of  g*  which  are  then  used  in  a  Certainty 
Equivalence  controller  a  derived  from  the  special  form  of  g.  Theorems  4.1  and  4.2  contain  the  main 
results  concerning  the  performance  of  this  policy  a,  namely  that  the  policies  a  and  g  yield  the  same 
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value  for  the  long-run  average  cost,  and  that  the  iterates  (77(71),  n  =  0, 1, . . .}  converge  a.s.  under 
a  to  the  bias  value  77*.  This  improves  on  earlier  results  of  the  authors  [25]  for  the  same  algorithm 
in  the  context  of  the  two-queue  system  with  no  routing.  There,  only  convergence  in  probability 
was  established,  albeit  under  weaker  conditions  on  moments. 

The  convergence  proof  for  the  stochastic  approximation  algorithm  hinges  on  the  availability  of 
bounds  on  moments  of  the  queue  size  process  which  are  uniform  in  the  policy,  and  on  the  smoothness 
properties  of  solutions  to  an  associated  Poisson  equation  [21,  27].  The  bounds  are  obtained  in 
Section  5  by  means  of  renewal  arguments  which  relate  the  queue  size  to  the  recurrence  times  to 
the  empty  state.  In  Section  6,  novel  arguments  are  developed  for  proving  the  Lipschitz  continuity 
for  solutions  to  the  Poisson  equation  and  for  establishing  bounds  on  them.  It  is  appropriate  to 
stress  the  methodological  value  of  both  Sections  5  and  6,  in  that  ideas  therein  are  by  no  means 
restricted  to  the  competing  queue  model  or  to  the  randomization  of  two  policies,  and  can  be 
used  mutatis  mutandis  in  many  other  situations.  However,  the  approach  was  developed  here  for 
a  stochastic  approximation  algorithm  for  a  specific  model,  rather  than  for  general  Markov  chains 
with  countable  state  spaces,  in  order  to  present  the  arguments  more  clearly,  unencumbered  from 
technical  details  and  assumptions  which  often  accompany  more  formal  treatments. 

The  a.s.  convergence  of  the  stochastic  approximation  scheme  defining  the  implementation  a 
is  established  in  Section  7,  where  the  various  estimates  of  the  previous  sections  allow  for  a  rather 
simple  proof.  Finally,  the  cost  properties  of  the  policy  a  are  discussed  in  Section  8  by  making  use 
of  the  convergence  of  the  stochastic  approximation  and  by  invoking  the  results  on  the  “Certainty 
Equivalence”  Principle  developed  in  [26];  the  requisite  hypotheses  of  [26]  are  easily  verified  for  this 
system  with  the  help  of  bounds  on  solutions  to  the  Poisson  equation.  The  paper  concludes  with 
an  application  to  the  constrained  optimization  problem  discussed  by  Nain  and  Ross  in  [22].  All 
necessary  conditions  are  verified  and  the  policy  a  thus  constitutes  an  implementation  of  the  Markov 
stationary  policy  which  is  constrained  optimal  for  this  problem. 

2.  A  GENERAL  MODEL 

This  section  introduces  a  general  class  of  projected  stochastic  approximations  driven  by  Marko¬ 
vian  noise.  The  formalism  of  the  theory  of  Markov  decision  processes  [REF]  was  found  notationally 
convenient  in  defining  the  class  of  stochastic  approximation  schemes  of  interest.  Indeed,  this  more 
general  framework  lends  itself  naturally  to  the  presentation  of  more  general  conditions  as  done  at 
the  end  of  Sections  5-7. 
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A  few  words  on  the  notation  and  conventions  used  throughout  the  paper.  The  set  of  all  non¬ 
negative  integers  is  denoted  by  IN,  and  Et  (resp.  IR+)  stands  for  the  set  of  all  real  (resp.  positive 
real)  numbers.  The  indicator  function  of  a  set  A  is  denoted  by  I[A].  Unless  stated  otherwise,  the 
notation  limn  and  limn  are  understood  with  n  going  to  infinity. 

2.1  The  MDP  formulation 

To  set  up  the  discussion,  first  consider  a  MDP  ( 5 ,  U,  P )  as  defined  in  the  literature  [REF] 
where  the  state  space  5  is  a  countable  set — for  sake  of  concreteness  it  will  be  convenient  to  take 
S  =  IN  K— and  the  action  space  U  is  a  compact  convex  subset  of  IRP.  The  one-step  transition 
mechanism  P  is  defined  through  the  one-step  transition  probability  functions  U  — >  1R  :  u  — ►  pxy(u), 
x,y  in  5 ,  which  are  assumed  to  be  Borel  measurable  and  to  satisfy  the  standard  properties 

0  ^  Pxy( ^0  ^  1,  ^  ^  Pxy^'u}  —  1}  u  G  17,  £,  y  £  S .  (2.1) 

The  space  of  probability  measures  on  U  (when  equipped  with  its  natural  Borel  <r-fi.eld)  is 
denoted  by  M(t7).  An  admissible  control  policy  it  is  defined  as  any  collection  (7Tra,  n  =  0, 1, . . .}  of 
mappings  rn  :  S  X  (U  X  S)n  — >  M(U)  such  that  for  all  n  =  0, 1, . . .  and  every  Borel  subset  B  of 
17,  the  mapping  5  X  (17  x  S)n  -*■  [0, 1]  :  hn  — >  Ttn{hn\  B )  is  Borel  measurable.  The  collection  of  all 
such  admissible  policies  is  denoted  by  V. 

The  definition  of  the  MDP  (5,  U,  P )  postulates  the  existence  of  a  measurable  space  (ft,^")  and 
of  a  collection  of  probability  measures  {Pv,  7r  £  V)  such  that  the  conditions  (2.3)-(2.5)  below  are 
satisfied.  The  measurable  space  (0,  T)  is  chosen  large  enough  to  carry  the  sequences  of  5-valued 
rvs  {X(n),  »  =  0, 1, . . .}  and  17-valued  rvs  {U(n),  n  =  0,1,.. .},  with  the  interpretation  that  X(n ) 
denotes  the  state  of  the  system  at  time  n  and  U(n)  represents  the  action  taken  in  that  state.  The 
feedback  information  is  encoded  through  the  rvs  { H(n ),  n  =  0,1,...}  defined  by  H{ 0)  :=  X(0) 
and 

H(n)  :=  (X(0),  17(0), X(l), . . .,  U(n  -  1),  X(n));  n  =  1, 2, . .  .(2.2) 

the  rv  H(n )  takes  values  in  Hn  :=  S  X  (U  X  S)n,  and  set  Tn  =  a{H(n)}. 

To  complete  the  definition,  let  p  be  a  fixed  probability  distribution  on  5.  For  every  admissible 
policy  7 r  in  V ,  the  probability  measure  P *  is  constructed  on  (ft,  .7)  such  that  under  Pv ,  the  rv  Xo 
has  distribution  p,  the  state  transitions  are  realized  according  to 

Pr[X(n+  1)  =  y  |  Tn  Vtf{{/(n)}]  =  pX(n)y(U(n)),  yeS  n  =  0,1,... (2.3) 
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and  the  control  actions  are  selected  according  to 


P*[U(n)  £B  \Tn]  =  MB]  H(n ))  n  =  0, 1, . .  .(2.4) 

for  every  Borel  subset  of  U.  Consequently, 

P'K[X(n+l)  =  y\Tn)  =  I  pX(n)y(Mn(du-,H(n)),  y  e  S.  n  =  0,1,...  (2.5) 

Ju 

The  expectation  operator  associated  with  7r  is  denoted  by  E 7r. 

The  measurable  space  (Cl,  P)  is  often  selected  to  be  the  so-called  canonical  space,  i.e.,  Cl  is  the 
cartesian  product  Cl  :=  S  x  (U  X  S)°°  endowed  with  the  natural  Borel  structure  inherited  from 
the  product  topology.  However,  in  many  concrete  situations,  it  is  more  convenient  to  describe  the 
underlying  MDP  on  a  measurable  space  ( Cl ,P)  which  is  somewhat  larger  than  the  canonical  space. 
For  example,  in  the  setup  considered  in  this  paper,  additional  rvs  are  needed  to  encode  arrivals, 
service  completions  and  random  routing  in  the  queueing  system.  In  this  case  the  definitions  of  M„, 
H(n )  and  Tn  are  changed  accordingly  in  the  obvious  way. 

Following  standard  usage,  a  policy  ir  in  V  is  said  to  be  a  Markov  or  memoryless  policy  if  there 
exists  a  family  { gn ,  n  =  0,1,...}  of  Borel  mappings  gn  :  S  — ►  M(U)  such  that  ir „(-;Hr(n))  = 
gn(-;X(n ))  P *  -  a.s.  for  all  n  —  0,1, —  In  the  event  the  mappings  {gn,  n  =  0,1,...}  are  all 
identical  to  a  given  mapping  g  :  S  — >  M(U),  the  Markov  policy  is  termed  stationary  and  is  identified 
with  the  mapping  g  itself.  Under  any  Markov  stationary  g,  the  state  process  {X(n),  n  —  0, 1, . . .} 
evolves  according  to  a  Markov  chain  with  one-step  transition  probability  matrix  P(g)  =  ( Pxy(g )) 
given  by 

Pxy{g)'-~  /  Pxy{u)g(du,x),  x,yeS.  (2.6) 

Ju 

Finally,  a  policy  7r  in  V  is  said  to  be  deterministic  or  non-randomized  policy  if  there  exists  a  sequence 
of  Borel  mappings  {/„,  n  =  0, 1, . . .}  such  that  for  each  n  =  0, 1, . . .,  the  mapping  /„  :  Mn  -»  U  is 
Borel  measurable  and  the  probability  measure  7rn(-;  H(n ))  is  a  point  mass  distribution  concentrated 
at  fn(H(n))  Pv  -  a.s. 

2.2  The  stochastic  approximation 

Stochastic  approximations  on  Markov  chains — as  defined  by  (1.1)  and  (1.3) — can  be  interpreted 
as  deterministic  policies  for  the  MDP  ( S ,  U,  P )  described  earlier.  To  see  this,  start  with  a  mapping 
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c  :  S  — *  H.p  and  let  {r](n),  n  —  0,1,...}  be  the  sequence  of  {/-valued  rvs  determined  by  the 
recursion 


»?(0)  €  U  ,  i](n+  1)  =  njy{7/(n)  +  an+1c(X(ra  +  1))}  .  n  =  0, 1, . .  .(2.7) 

As  before,  lie/  denotes  the  nearest-point  projection  on  U,  and  the  step  size  sequence  {an+i,  n  = 
0, 1, . . .}  satisfies  the  usual  conditions 

oo  oo 

0  <  an  j  0,  an  =  oo  and  ^  a\  <  oo.  (2.8) 

n=0  n=0 

The  policy  associated  with  the  recursion  (2.7)  is  the  deterministic  policy  a  =  {an,  »  =  0, 1, . . .} 
with  the  property  that  for  all  n  =  0, 1, . . .,  an( •;  H{n ))  is  the  point  mass  distribution  concentrated 
at  7](n).  That  this  policy  is  indeed  admissible  follows  from  the  fact  that  for  each  n  =  0, 1, . . .,  the 
rv  i)(n )  can  expressed  as  a  function  of  the  successive  states  X(0),X(1), . . .,  X(n). 

3.  THE  DISCRETE-TIME  KLIMOV  MODEL 

This  section  presents  in  some  details  the  model  for  the  controlled  queueing  system  briefly 
described  in  the  introduction.  First,  a  few  words  on  the  notation  and  convention  in  use.  Elements 
of  IRA  are  always  interpreted  as  K  X  1  column  vectors,  and  the  kth  component  of  any  element 
x  of  HtA  is  denoted  by  xk,  1  <  k  <  K,  with  a  similar  convention  for  rvs.  Thus  an  element  x 
of  1Rk  can  also  be  written  as  (®i, ...,  £/<)'  (with  '  denoting  transpose),  and  its  norm  is  given  by 
I  x  l:=  The  elements  e  and  0  of  IR7'*'  are  defined  as  the  vectors  e  =  (1,...,  1)'  and 

0  =  (0,  ...,0y  with  identical  components.  The  standard  basis  {e1, ...,  eK }  for  is  denoted  by  Bk, 
while  Sk  is  the  standard  simplex  defined  by 

I< 

SK  :=  {V  €  IRK  :  Pk  =  1  and  0  <  pk  <  1,  1  <  k  <  K}.  (3.1) 

k=  1 

It  is  plain  that  Sk  can  be  identified  with  M({7)  when  U  =  Bk- 

3.1  The  basic  random  variables 

The  controlled  queueing  system  of  interest,  the  so-called  discrete-time  Klimov  model,  will  be 
defined  as  a  MDP.  All  probabilistic  elements  are  defined  on  a  single  sample  space  ft  equipped 
with  the  cr-field  of  events  T.  This  sample  space  carries  the  basic  rvs  H,  {U(n),  n  =  0,1,...}, 
{A(n),  n  =  0, 1, . . .},  { B(n ),  n  —  0, 1, . . .}  and  {R(n),  n  —  0, 1, . . .}  which  take  values  in  1N/C,  Bk-, 
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1NK,  {0, 1}K  and  {0, 1, . . .  K } K ,  respectively.  These  quantities  have  a  ready  interpretation  in  the 
context  of  the  queueing  system  described  in  the  introduction:  The  number  of  customers  initially 
in  the  kth  queue  is  set  at  Ek  and  for  each  n  =  0, 1, . . the  state  of  the  system  is  represented  by  a 
rv  X(n )  of  integer  components  with  the  interpretation  that  at  the  beginning  of  the  slot  [n,  n  +  1), 
Xfc(ra)  customers  are  present  in  the  kth  queue,  including  the  one  receiving  service.  The  following 
chain  of  events  occurs: 

(i) :  The  control  action  U(n )  is  selected  with  the  convention  that  Uk(n)  =  1  (resp.  Uk(n )  =  0) 

if  the  kth  queue  is  (resp.  is  not)  given  service  attention  during  that  slot.  The  fact  that 
U(n )  takes  values  in  Bk  guarantees  that  exactly  one  queue  is  given  service  attention; 

(ii) :  New  customers  arrive  into  the  system  according  to  the  rv  A(n)  with  Ak(n )  new  customers 

joining  the  kth  queue; 

(iii) :  A  completion  of  service  possibly  occurs  at  the  queue  that  was  given  service  attention 

during  the  slot.  This  is  encoded  in  the  binary  rv  B(n),  where  Bk(n )  =  1  (resp.  Bk(n )  =  0) 
signifies  successful  completion  (resp.  abortion)  of  service  for  the  kth  queue  conditioned  on 
it  being  given  service  attention  and  non-empty;  and 

(iv) :  If  a  service  completion  occurs  at  the  queue  that  was  given  service  attention  during  the 

slot,  then  instantaneously  the  serviced  customer  is  either  transferred  to  another  queue  or 
it  leaves  the  network.  This  routing  decision  is  implemented  through  the  variable  R(n ) 
with  the  following  interpretation.  If  the  service  completion  occurred  at  the  kth  queue, 
then  Rk(n )  =  t,  1  <  l  <  K,  means  that  the  serviced  customer  joins  the  Ith  queue  while 
Rk(n )  =  0  expresses  the  fact  that  this  customer  leaves  the  system. 

As  a  result  of  (i)-(iv),  the  successive  system  states  or  queue  size  vectors  form  a  sequence 
{X(n),  n  =  0, 1, . . .}  of  IN7'' -  valued  rvs  which  are  generated  componentwise  through  the  recursion 

Xk(n  +  1)  =  Xk{n)+Ak{n)  -  I[Xk(n)  ?  0 }Uk(n)Bk{n) 

I< 

+  £/[*<(«)  ^  0 ]Ue(n)Be(n)I[Re{n)  =  k] 
e=i 


1  <k<K,  n  =  0,1,...  (3.2) 

with  X(0)  :=  E. 

At  the  beginning  of  each  time  slot  [n,  n  +  1),  the  decision-maker  has  knowledge  of  the  rv  H(n ) 
which  here  includes  the  initial  queue  sizes  E,  the  past  arrival  pattern  A{i),  0  <  i  <  n,  past  decisions 
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U(i),  0  <  i  <  n,  past  service  completions  B(i),0  <  i  <  n  and  past  routing  decisions  i2(i),0  <  i  <  n. 
The  rvs  {H(n),  n  =  0, 1, . . .}  are  thus  given  recursively  by 

H( 0)  =  E  ,  H(n+  1)  :=  (H(n),U(n),  A(n),  B(n),  R(n));  n  =  0, 1, . . .  (3.3) 

The  information  contained  in  H(n )  is  used  to  generate  the  control  value  U(n)  implemented  in  the 
slot  [ft,  ft  +  1).  The  selection  of  this  control  value  is  done  according  to  a  prespecified  mechanism, 
which  may  be  either  deterministic  or  random. 

3.2  The  probabilistic  structure 

Since  randomized  strategies  are  allowed,  an  admissible  control  policy  7r  is  defined  as  any  col¬ 
lection  {7 r„,  ft  =  0,1,...}  of  mappings  rn  :  H„  Sk,  with  the  interpretation  that  at  times 
ft  =  0, 1, . . .,  the  kth  queue  is  given  service  attention  with  probability  7rn(fc;  hn)  whenever  the  in¬ 
formation  vector  hn  in  HIn  is  available  to  the  system  controller.  Denote  the  collection  of  all  such 
admissible  policies  by  V. 

Let  qs(-)  and  q(-)  be  two  probability  mass  distributions  on  1N/C,  and  fix  a  service  rate  vector 
p  in  (0,  l]/c.  Moreover,  let  P  denote  a  K  x  K  substochastic  matrix  (pkg,  1  <  k,t  <  K ),  i.e., 

I< 

0  <  pke  <  1  and  J2pke<l  ,  1  <k,t<K  (3.4) 

e=i 

and  set 

I< 

Pko'-l-^Pki  ,  1  <  k  <  I(  .  (3.5) 

£=1 

Throughout  the  discussion,  the  non-degeneracy  condition 


0  <  q( 0)  <  1 


(3.6a) 


and  the  finite  mean  condition 

|a|g(a)  <  00  (3.66) 

a€NK 

are  enforced.  It  is  always  assumed  that  the  matrix  I—P  is  invertible,  a  condition  which  is  equivalent 
to  the  system  being  open,  i.e.,  every  customer  eventually  leaves  the  system  with  probability  one. 

The  model  is  now  completely  specified  by  postulating  the  existence  of  a  family  {PK ,  n  £  V} 
of  probability  measures  on  the  a-field  T  which  satisfies  the  requirements  (R1)-(R3)  below,  i.e., 
for  every  policy  7r  in  V, 
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(Rl):  For  all  x  in  1N/C, 


P«[Z  =  x]  :=«=(*); 


(R2):  For  all  a  in  ]N/f,  b  in  {0,l}/<:  and  r  in  {1, 2, . . K}K, 

P7r[A(n)  =  a,  B(n )  =  b ,  R(n)  —  r  |  Tn  V  cr{Z7 («•)}] 

■.=P*{A{n)  =  a]P*[B(n)  =  b\P*[R(n)  =  r] 

n  =  0,1,... 

K  I< 

=q(a)  •  (bk/J-k  +  (1  -  bk)(  1  -  pk))  • 

fc=l  fc=l 

where  Tn  —  a{H(n)}  with  H(n)  defined  by  (3.3). 

(R3):  For  all  ek  in  Bk ,  1  <  k  <  K , 

Pir[t/' (n)  =  efc|JFn]  :=  x n(k;  Hn).  n  =  0, 1, . . . 

The  existence  of  a  sample  space  (Cl,P)  that  carries  such  a  family  of  probability  measures 
{P* ,  x  £  V}  is  easily  established  via  the  Kolmogorov  Extension  Theorem,  by  taking  to  be  the 
canonical  space  IN/f  x  (Bk  X  lN/<:  x  (0, 1}7C  x  (0, 1, . . .,  K}k)°°  equipped  with  its  natural  a-field. 
This  modeling  approach  was  adopted  in  [25]  for  a  special  case  of  the  Markov  decision  process  under 
consideration;  the  reader  is  referred  there  for  additional  information. 

The  reader  will  readily  check  that  for  each  policy  x  in  V,  the  following  properties  (P1)-(P4) 
hold  true  under  P7r,  where 

(PI):  The  lN/c-valued  rv  S  and  the  sequences  of  rvs  {A(n),  n  =  0, 1, . . .},  {B(n),  n  =  0, 1, . . .} 
and  (P(n),  n  —  0, 1, . . .}  are  mutually  independent ; 

(P2):  The  sequences  {Bk(n),  n  —  0, 1, . . .}  of  {0,  l}-valued  rvs  are  mutually  independent  i.i.d. 
Bernoulli  sequences  with  parameters  ^ ,  1  <  <  PtT ; 

(P3):  The  sequences  {Rk(n),  n  =  0,1,...}  of  {0, 1, . . ., 7f}-valued  rvs  are  mutually  i.i.d.  se¬ 
quences  with 

P*[Rk(n)  =  i)  =  Pke,  1  <  k,£  <  K  ;  n  =  0,1,... (3.7) 

(P4):  The  IN/f-valued  rvs  (A(n),  n  =  0,1,...}  form  a  sequence  of  i.i.d.  rvs  with  common 
probability  distribution  </(•). 
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For  1  <  k  <  K,  denote  by  A*,  the  first  moment  of  the  sequence  {Ak(n),  n  =  0, 1, . . .}  and  set 
pk  —  pk~1  ■  F°r  future  use,  define  the  network  traffic  coefficient  p  by 

p:=  A'fJ-P)"1!/  (3.8) 

where  A  =  (Aj, . . A k)'  and  v  =  (iq, . .  .,vk)'- 

A  policy  7T  in  V  is  said  to  be  non-idling  or  work-conserving  whenever  for  all  1  <  k  <  K,  the 
condition 

[7rn(^;  H(n))  >  0,  X(n)^0]  =  [*„(*;  H(n))  >  0,  Xk(n)  ±  0]  n  =  0, 1, . .  .(3.9) 

holds  true  P^-a.s. 

4.  PROBLEM  FORMULATION 

Let  c  denote  a  mapping  1NK  — ►  IR.  For  any  admissible  policy  r  in  V,  set 

h  n 

J{k)  :=  lim„£,r 

4  =  0 

(whenever  meaningful)  with  the  usual  interpretation  that  J( 7r)  is  a  measure  of  system  performance 
when  using  the  policy  w. 

4.1  Steering  the  cost 

Constrained  MDPs  lead  to  optimal  stationary  policies  which  randomize  between  several  sta¬ 
tionary  deterministic  policies  [6,  7,  23, 24].  Given  the  constituent  deterministic  policies,  the  problem 
of  finding  the  optimal  policy  reduces  to  simultaneously  steering  constraint  functionals  of  the  form 
(4.1)  to  given  values.  For  simplicity,  only  the  scalar  case  (arising  from  a  single  constraint)  is  dis¬ 
cussed  here,  in  which  case  the  steering  problem  consists  in  finding  a  Markov  stationary  policy  g 
such  that  J(g)  =  V  for  some  given  constant  V.  The  discussion  is  given  under  the  assumption  that 
there  exist  two  Markov  (possibly  randomized)  stationary  policies  g  and  lj  such  that 

J(g)  <V<  J(g).  (4.2) 

For  every  p  in  the  unit  interval  [0,1]  the  policy  fn  is  the  Markov  stationary  policy  obtained  by 
simply  randomizing  between  the  two  policies  g  and  ~g  with  bias  r/;  it  is  determined  through  the 
mapping  fn  :  !N/<:  — »  Sk  where 

P{k\ x)  rjg(k\  x)  +  (1  —  r])g(k-,  x),  x£fiK,  1  <  k  <  K.  (4.3) 


n  -I-  1 


£CW')) 


(4.1) 
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Note  that  for  77  =  1  (resp.  77  —  0)  the  randomized  policy  f1  coincides  with  ~g  (resp.  g).  Owing  to 
(4.2),  if  the  mapping  g  J(fn)  is  continuous  on  the  interval  [0,1],  then  at  least  one  randomized 
strategy  fv*  meets  the  value  V  and  its  corresponding  bias  value  77*  is  a  solution  of  the  equation 

J{P)  -  V,  0  <  r,  <  1,  (4.4) 

so  that  the  identification  g  =  P"  may  take  place. 

4.2  Implementation  issues 

Solving  the  (highly)  nonlinear  equation  (4.4)  for  the  bias  value  77*  is  usually  a  non-trivial  task, 
even  in  the  simplest  of  situations  [18,  22].  This  difficulty  is  circumvented  by  proposing  alternatives 
to  the  policy  g  which  bypass  a  direct  solution  of  the  equation  (4.4).  One  possible  approach  is  to 
design  (simple  recursive)  schemes  for  estimating  the  value  77*  which  solves  (4.4)  and  then  to  define 
a  so-called  “naive  feedback”  policy  a  —  {«„,  n  =  0, 1, . . .}  via  the  Certainty  Equivalence  Principle. 
Such  a  policy  a:  can  be  written  in  the  form 


an  :=  g(n)g  +  (1  -  g(n))g  n  =  0,1,...  (4.5) 

for  some  sequence  of  [0,l]-valued  rvs  {77(77),  n  —  0, 1, . . .}  which  act  as  “estimates”  for  the  bias  value 
77*.  It  is  hoped  that  the  effects  of  controlling  and  learning  about  the  system  will  combine  to  produce 
a  consistent  estimation  scheme.  In  such  a  case,  the  sequence  of  estimates  (77(77),  n  =  0,1,...} 
converges  to  the  value  77*  in  some  sense,  thus  providing  increasingly  better  approximations  to  the 
appropriate  bias  value.  This  policy  a  will  constitute  an  acceptable  implementation  of  g  provided 
J(a)  =  J(g). 

At  this  point,  the  reader  may  wonder  as  to  how  such  an  estimation  scheme  can  be  selected. 
If  the  function  77  —»  J(f'n)  were  continuous  and  strictly  monotone  (necessarily  increasing  by  (4.2)- 
(4.3)),  then  the  search  for  77*  could  be  interpreted  as  finding  the  zero  of  the  continuous,  strictly 
monotone  function  77  — *  J{P)  -  V ,  and  this  brings  to  mind  ideas  from  the  theory  of  stochastic 
approximations  [16].  Here,  the  Robbins-Monro  version  of  these  algorithms  suggests  that  a  sequence 
of  bias  values  {77(71),  tz  =  0,1,.. .}  be  generated  through  the  recursion 


m  g  u , 


77(77 + 1)  = 


1 1 


77(77)  +  an+1(V  -  c(X(n  +  1)) 


77  =  0,1,... (4.6) 


In  (4.6)  the  notation  [a;]o  =  0  V  (x  A  1)  is  used  for  every  x  in  1R,  and  the  sequence  of  step  sizes 
{an,  77  =  1,2,...}  satisfies  the  conditions  (2.8). 
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Lemma  8.2  gives  a  set  of  conditions  under  which  the  monotonicity  property  holds. 

4.3  The  results 

This  paper  is  devoted  to  analyzing  the  performance  of  the  adaptive  policy  a  defined  through 
(4.5)-(4.6).  The  main  results  in  this  direction  are  described  below  and  require  the  additional 
assumptions  (R4)-(R6)  on  the  data  of  the  problem,  where 
(R4):  There  exists  some  integer  7  >  1  such  that  for  every  policy  ir  in  V,  the  moment  conditions 


£’[  m=  £  WfeWcoo 

seNK 

and 

E’l  M(»)H  =  £  l<.p9(«)  <  00 

a£NK 

hold  true; 

(R5):  There  exist  an  integer  6  >  0  and  a  constant  L  >  0  in  1R  such  that 


n  =  0, 1, . 


c(m)|  <  £(1+  |  x  I5)  =:  c(|a:|),  x  E  1NK; 


and 

(R6):  The  policies  7j  and  g_  are  non-idling  Markov  stationary  policies  such  that  (4.2)  holds. 

The  results  concerning  the  policy  a  are  now  summarized. 

Theorem  4.1.  Assume  (R1)-(R6)  to  hold  with  p  <  1  and  let  the  integer  exponent  7  in  (R4) 
and  6  in  (R5)  satisfy  the  condition 

26  +  3  <  7.  (4.8) 

If  the  mapping  rj  — *  Jif11)  is  strictly  monotone,  then 

limn  r](n)  =  i)*  Pa  -  a.s.  (4.9) 


Under  these  conditions,  the  system  also  satisfies  a  Certainty  Equivalence  Principle  [26]  which 
takes  the  following  form. 

Theorem  4.2.  Assume  (R1)-(R6)  to  hold  with  p  <  1  and  let  the  integer  exponent  7  in  (R4) 
and  6  in  (R5)  satisfy  the  condition 


max{3, 1  +  6(1  +  e)}  <  7 


(4.10) 
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for  some  e  >  0.  If  limn  77(71)  =  77*  in  probability  under  Pa ,  then  the  convergence 

J(c)  =  “«**  “W*))  =  J(»)  (4.n) 

5=0 

takes  place  in  Z1(fi,  T ,  P"),  so  that 

1  * 

J(a)  =  limt  £“  [—  ^  c(X(s))j  =  J(g).  (4.12) 

s=0 

Moreover,  for  any  other  mapping  d  :  1NA  — *■  IR,  if  there  exist  an  integer  6'  >  0  and  a  constant 
L'  >  0  such  that 

\d{x)  I  <  i'(l+  I  X  f),  X  e  INK  (4.13) 

then  both  (f.l  1)  and  (f.  12)  hold  for  the  long-run  average  cost  (f.l)  associated  with  d  provided  the 
condition 

max{3, 1  +  £'(1  +  e'))  <  7  (4-14) 

holds  for  some  e'  >  0. 

The  restriction  that  6  and  S'  be  integers  is  not  essential  but  results  in  some  simplifications  in  the 
notation.  An  example  where  the  hypotheses  of  Theorems  4. 1-4.2  hold  is  given  in  Section  8. 

This  section  closes  with  a  few  facts  which  are  easily  derived  from  the  enforced  assumptions: 
Under  (R6),  the  policies  fv,0  <  rj  <  1,  and  a  are  all  non-idling  since  g  and  g  are  non-idling. 
Moreover,  note  from  (3.2)  that 

Xk(n  +  1)  <  Xk{n)  +  Ak(n)  +  1,  1  <  k  <  K.  n  =  0, 1, . .  .(4.15) 

and  therefore,  by  virtue  of  (R.4),  Ev[ |A(t7)|7]  <  00  for  all  n  =  0, 1, . . .  under  any  policy  7r  in  V. 
Since  6  <  7  under  either  (4.8)  or  (4.10),  it  is  then  immediate  from  (R5)  that 

E*  [|c(X(77))|]  <  L(1  +  E*  [|X(n)|5])  <  00  (4.16) 

and  therefore  J( 7r)  is  always  well  defined  (and  in  fact  finite  by  Theorem  5.1  below).  A  similar 
argument  shows  that  under  the  conditions  (4.13)  and  (4.14),  the  long-run  average  cost  associated 
with  d  is  also  well  defined  and  finite  under  any  policy  7r  in  V. 

5.  MOMENT  ESTIMATES 
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5.1.  The  bounds 


The  proofs  of  Theorems  4.1  and  4.2  require  that  bounds  on  moments  of  the  rvs  {|  X(n )  |  ,  n  = 
0,1,...}  be  available  which  are  uniform  over  the  class  of  all  non-idling  policies  w  in  V.  The 
derivation  of  such  bounds  is  given  below,  and  is  based  on  the  key  observation  that  the  total  number 
of  customers  in  the  system  at  any  given  time  n  decreases  by  at  most  one  unit  in  the  next  time 
slot  [n,  n  +  1),  and  is  therefore  bounded  above  by  the  number  of  slots  it  takes  for  the  queue  sizes 
to  empty  for  the  first  time  after  n.  This  simple  fact  can  be  used  to  advantage  when  combined  to 
the  detailed  statistical  information  obtained  by  the  authors  in  [19]  on  the  time  until  the  system 
empties,  and  leads  to  the  following  strong  estimates. 

Theorem  5.1.  Assume  (Rl)-(R5)  with  p  <  1.  There  exists  a  single  positive  constant  such 
that  for  every  non-idling  policy  it  in  V ,  the  moment  estimate 

suPn  E*[ |  X(n)  p"1]  <^<00  (5.1) 


holds  true. 

Theorem  5.1,  the  proof  of  which  is  presented  below,  turns  out  to  be  a  special  case  of  an 
intermediate  result  of  independent  interest  given  in  Theorem  5.4.  Before  discussing  this  more 
general  result,  it  is  convenient  to  notice  the  following  simple  and  useful  consequence  of  (5.1). 

Corollary  5.2.  Assume  (R1)-(R5)  with  p  <  1.  Whenever 7  >  2,  the  rvs  {|X(n)|,  n  —  0, 1,...} 
are  uniformly  integrable  under  the  probability  measure  Pv  associated  with  any  non-idling  policy  n 
in  V. 

5.2.  Recurrence  properties 

To  formalize  the  argument  outlined  earlier,  it  is  necessary  to  study  the  recurrence  structure  of 
the  process  {X(n),  n  —  0, 1, . . .}  under  any  non-idling  policy  7r  in  V.  To  that  end  consider  the  rvs 
{rfc,  k  =  0, 1, 2, . . .}  and  { o k,  k  =  1,2,...}  defined  recursively  by  r0  =  0\  :=  0,  and 

Tk+ 1  :=  inf  {n  >  ok+ 1  :  X(n)  =  0}  k  =  0, 1, . . .  (5.2a) 

and 

crfc+i  :=  inf{n  >  rk  :  X(n)  ±  0}  k  =  1,2,. .  .(5.2 b) 

with  the  convention  that  Tfc+1  =  00  (resp.  ok+i  =  00)  whenever  the  defining  set  is  empty  or  when 
<7fc+i  =  00  (resp.  t/j  =  00).  Note  that  these  definitions  are  different  from  those  given  in  [19]  (where 
i/(0)  denotes  the  present  rv  ri).  For  k  =  2,3, .. .  the  rv  is  the  time  epoch  at  which  the  system 
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empties  itself  for  the  ( k  -  l)r3t  time  after  T\,  so  that  &k+i  is  the  time  epoch  when  the  system 
becomes  again  non  empty  for  the  first  time  after  rk.  Moreover,  define  the  rvs  {9k,  k  =  1,2, . . .}  by 

#fc+i  =  rfc+i  —  Tk  k  =  0, 1, . .  .(5.3) 

(with  the  convention  oo  -  oo  =  0)  so  that  0\  =  T\.  The  following  proposition  summarizes  results 
which  were  obtained  in  Sections  4-5  of  [19]. 

Proposition  5.3.  Assume  (R1)-(R4)  with  p  <  1.  Under  any  non-idling  policy  ir  in  V,  the  rvs 
{9k,  k  =  1,2,...}  form  a  delayed  renewal  process  whose  statistics  are  independent  of  the  policy  n, 
with  finite  means  given  by 


E*[0k  |X(0)  =  x]  = 


'  jhr*V  -  p)-1^  +  j^i[x  =  o] 

i  i 

1-9(0)  ■  1  -p 


if  k  =  1 
if  k  =  2,3,... 


(5.4) 


for  all  x  in  IN74'.  Moreover,  the  rv  82  possesses  finite  moments  of  order  7,  and  for  every  integer  l, 
1  <  &  <  7)  there  exists  a  positive  constant  Ct  (independent  of  the  policy  n)  such  that 

E”[t{\X{0)  =  a-]  <  Ct{  1  +  |x|7)  ,  x  e  IN*  .  (5.5) 


In  view  of  this  result,  it  is  natural  to  introduce  £x  as  the  expectation  operator  with  respect 
to  the  distribution  of  77  given  that  X(0)  =  x  and  that  any  non-idling  policy  is  used.  Finally,  for 
reference,  denote  by  G{-)  the  distribution  of  the  rv  rj)  and  by  F(-)  the  common  distribution 
of  the  i.i.d.  rvs  {9k,  k  —  2,3,. ..}.  By  definition,  the  distributions  G(-)  and  F(-)  do  not  coincide. 

5.3.  A  renewal  estimate 

The  (continuous-time)  counting  process  {N(t),  t  >  0}  naturally  associated  with  the  sequence 
(rn,  n  —  0, 1, . . .}  is  defined  by 

N{t)  :=  max{fc  >  0  :  rk  <  t},  t  >  0  (5-6) 

with  the  ready  interpretation  that  N(t)  represents  the  number  of  times  the  queue  has  returned  to 
the  empty  state  by  time  t.  With  this  notation,  the  observation  made  earlier  translates  into 

I  X{n)  |<  tN(u)+1  -  n.  n  =  0,1,... (5.7) 
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Now,  for  any  monotone  non-decreasing  mapping  r  :  H+  — *•  1R+,  set 

Ro{t)  :=  E^[r(TN{t)+1  -  /)],  t  >  0.  (5.8) 

The  subscripts  G  and  S  in  (5.8)  emphasize  the  fact  that  the  system  is  started  with  an  initial  queue 
size  E  distributed  according  to  the  distribution  </=(•).  Since  the  sequence  {Ok,  k  =  2,3, . . .}  is  a 
non-delayed  renewal  sequence,  it  is  appropriate  to  define 


RF{t)  :=  E*[r(TN(t+Tl)+1  -  (t  +  ri))],  t  >  0  (5.9) 

as  this  corresponds  to  a  non-delayed  renewal  process  with  G  =  F. 

The  first  part  of  this  section  is  devoted  to  deriving  a  bound  on  the  expected  values  {Ra(t),  t  > 
0}  for  any  non-idling  policy  x,  with  a  view  towards  generating  (via  (5.7))  a  bound  for  the  sequence 
of  expected  values  { JE77r[r( |  X(n)  |)],  n  =  0, 1, . . .}. 

Theorem  5.4.  Assume  (Rl)— (R4)  with  p  <  1  and  let  tt  be  an  arbitrary  non-idling  policy  in  V. 
Under  the  finite  moment  assumptions 

p OO  p OO 

mc(r)  :=  /  r(9)dG{9)  <  oo  and  mp(r)  :=  /  r{9)dF{9)  <  oo,  (5.10) 

Jo  Jo 

the  condition 

/•oo  r6  [OO  r  OO 

Kf(t)  :=  /  /  r(9  —  t)dtdF{9)  =  /  /  r{9  -  t)dF(9)dt  <  oo  (5.11) 

Jo  Jo  Jo  Jt 


implies 


sup  Ra(t)  =  sup£!.[r(7-N(,)+1  -  /)]  <  oo. 
oo  oo 


(5.12) 


Proof:  Let  tg  and  rF  be  the  mappings  IR4.  — >  R+  defined  by 


/OO  r  OO 

r{9  -  t)dG{9)  and  rF(t)  :=  J  r(9-t)dF(9),  t>  0.  (5.13) 


The  finiteness  conditions  (5.10)  translate  into  tg(0)  =  ma{r )  <  00  and  t\f(0)  =  mF(r)  <  00. 
Since  r  takes  positive  values  and  is  monotone  non-decreasing,  the  indefinite  integrals  entering  the 
definition  (5.13)  are  well  defined,  and  satisfy  the  inequalities 

r  OO  p  OO  pOO 

0<  /  r{9  -  t)dG{9)  <  /  r{9  -  s)dG{9)  <  /  r{9  -  s)dG{9)  (5.14) 

Jt  Jt  Js 
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whenever  0  <  s  <  t.  As  a  result,  the  mapping  tq  is  well  defined  and  monotone  non-increasing. 
Similar  comments  hold  for  rF. 

A  standard  renewal  argument  [12,  pp.  183]  applied  to  the  process  —  t),  t  >  0} 

shows  that 


RG(t)  =  [  RF(t  -  0)dG(0)  +  f  r(9  -  t)dG(9),  t  >  0 
Jo  Jt 


(5.15) 


whence 


ft  fOO 

Rc(t)<  /  RF(t  -  9)dG(6)  +  /  r(9)dG(0) 

Jo  Jo 

<  sup  Rf(s)  +  rriG(r),  t>  0 

0<s<t 


(5.16) 


by  the  remarks  made  earlier.  This  clearly  shows  that  under  (5.10),  the  result  (5.12)  will  hold  if  the 
bound 

sup  RF(t)  <  oo  (5-17) 

t>  o 


can  be  established. 

When  G  ~  F,  the  renewal  equation  (5.15)  specializes  to 


/ 


RF(t)  =  rF(t)  +  /  RF(t  —  9)dF{9),  t  >  0. 


(5.18) 


Since  the  mapping  rF  is  monotone  non-increasing  and  takes  non-negative  values,  it  is  therefore 
integrable  as  a  result  of  (5.11),  whence  directly  Riemann  integrable  [12,  pp.  190-191].  The  fact  that 
0  <  rF(t)  <  mF(r)  for  all  t  >  0  implies  that  RF  is  bounded  on  finite  intervals  [12,  Thm.  4.2,  p. 
184].  Finally,  note  that  the  distribution  F(-)  has  support  on  IN  and  is  therefore  arithmetic,  say 
with  span  d.  All  requisite  conditions  are  now  in  place  to  apply  the  Basic  Renewal  Theorem  [12, 
Thm.  5.5.1,  p.  191]  to  the  renewal  equation  (5.18)  to  obtain 


d 

limn  Rf(c  +  nd )  =  -  )  rF(c  +  nd )  ,  c  >  0 

mF 

r  n=0 


(5.19) 


where  mF  is  the  first  moment  of  F  (which  is  finite  by  Proposition  5.3).  Since  the  mapping  rF  is 
non-increasing,  it  readily  follows  from  (5.19)  that  for  all  c  >  0, 


lim„  Rf(c  +  nd)  < - 

mF 


OO  ^ 

drF( 0)  +  ^  drF(nd )  > 
n=  1  J 
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(5.20) 


^-jdrF(0)  +  J  rF(t)dt^ 

{i dmF(r )  +  I(F(r)}  <  oo 


mF 

1 

mF 


where  the  finiteness  of  the  last  bound  results  from  (5.10)-(5.11).  In  particular, 

limn  RF(nd  +  l)  <  {dmF(r)  -f  K F(r)}  <  oo  ,  £=l,2,...,d  (5.21) 

mF 

and  therefore 

supn  RF(n)  <  oo.  (5.22) 

Since  N(t)  in  constant  on  [ft,  n  +  1),  direct  inspection  of  (5.9)  shows  that  RF(t)  <  RF(n)  whenever 
ft  <  t  <  ft  +  1  owing  to  the  monotonicity  of  r,  whence  sup*>0  RF(t)  =  supn  RF(n)  and  (5.17)  is 
now  immediate  from  (5.22)  ■ 

Proof  of  Theorem  5.1:  Start  with  the  mapping  r  given  by  r(x )  =  a;7-1  for  all  x  >  0,  and  observe 
that 

/•oo  1-6  1  f  oo 

KF(r)  =  /  r(0-  t)dtdF{0)  =  -  /  0^dF(9).  (5.23) 

Jo  Jo  1  Jo 

Under  (R4),  Proposition  5.3  and  (5.23)  imply  the  conditions  (5.10)-(5.11),  and  a  straightforward 
application  of  Theorem  5.4  yields  (5.1).  ■ 

5.4.  Extensions 

Bounds  of  the  form  (5.1)  are  related  to  the  stability  of  the  system  under  the  class  of  policies  of 
interest,  and  are  typically  established  through  system-specific  arguments.  In  fact,  as  will  become 
apparent  from  the  discussion  in  Sections  6-7,  (5.1)  need  only  hold  for  a  small  number  of  policies. 
The  methods  of  the  present  section  can  be  extended  to  the  general  model  in  the  following  way. 
Let  II  denote  a  class  of  policies  under  which  (5.1)  is  sought  to  hold,  and  consider  the  conditions 
(G1)-(G3)  below,  where 

(Gl):  There  exists  a  positive  constant  cs  such  that  for  every  policy  n  in  II, 


P*[\  X(ft  +  1)  |<|  X(n)  |  -cs]  =  0  .  ft=  1,2,... (5.24) 

Define  the  rvs  {r^,  Ok,  6k,  k  =  1,2, . . .}  and  {N(t),  t  >  0}  as  in  Section  5.2,  and  assume  all 
these  rvs  to  be  finite  a.s.  under  each  policy  7r  in  II.  Under  (Gl),  it  follows  that  for  every  7r  in  II, 

I  X(n)  |<  ca(r^(n)+1  -  ft)  P*  -  a.s.  n  =  1,2, ..  .(5.25) 
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Note  that  in  this  general  context,  {N(t),  t  >  0}  may  not  be  a  renewal  process,  since  ir  need  not 
be  a  stationary  policy.  Let  l  be  a  positive  integer,  and  introduce  conditions  (G2)-(G3)  as 

(G2):  There  exists  a  stationary  policy  7r*  in  II  such  that  for  every  policy  7r  in  II  either 

E*  [|  X(n)  I'-1 1  X(0)  =  x]  <  E **  [|  X(n)  I""1!  X(0)  =  x]  , 


or 

E '  [|  rN{n)+1  -  n  I*"1 1  X(0)  =  x]  <  E [|  rN(n)+l  -  n  M  X(0)  =  *] 

x  6  1N7<",  n  —  1,2, . . . 

and 

(G3):  For  every  initial  condition  x  in  IN7^,  there  exists  a  positive  constant  Ci{x)  such  that 

En*  [t(  |  X(0)  =  x]<  Ce(x). 

Note  that  (Gl)-(G2)  together  imply  via  (5.25)  that  for  every  policy  7r  in  II, 

Ev  [|  X(n)  I'"1]  <  cJ-'E*’  [|  rN(n)+1  -  n  |f-7]  .  n=  1,2,..  .(5.26) 

Assumptions  (G1)-(G3)  are  natural  in  queueing  systems  when  conservation  laws  are  available. 
Note  that  |  •  |  could  denote  any  norm  on  JR,7' ,  a  fact  which  could  be  use  to  advantage  when 
employing  conservation  laws.  The  generalization  of  Theorem  5.1  can  now  be  stated. 

Theorem  5. Ibis.  Consider  the  general  model.  Under  (G1)-(G3),  there  exists  a  single  constant 
/v7  (with  7  =  l)  such  that  (5.1)  holds  for  every  policy  7r  in  II. 

Proof:  Under  the  Markov  stationary  policy  tt*,  the  process  {N(t),  t  >  0}  is  a  delayed  renewal 
process.  The  proof  of  (5.1)  with  n  =  n*  is  identical  to  the  proof  of  Theorem  5.1  and  the  desired 
result  now  follows  from  (G2).  ■ 

6.  ON  THE  POISSON  EQUATION 
6.1.  The  Poisson  equation 

Fix  7]  in  the  unit  interval  [0, 1]  and  denote  by  Pn  (resp.  Ev)  the  probability  measure  (resp.  ex¬ 
pectation  operator)  induced  by  the  policy  fv.  Moreover,  let  P£  (resp.  Ef.)  denote  the  (conditional) 
probability  measure  (resp.  expectation  operator)  induced  by  the  policy  fn  given  that  X(0)  =  x, 
with  x  ranging  in  1N7C. 
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Recall  that  under  Pv,  the  rvs  (X(n),  n  —  0,1,...}  form  a  time- homogeneous  Markov  chain 
over  IN*,  and  let  ( Pv(x,y ))  denote  the  corresponding  one-step  transition  probabilities.  It  is  plain 
from  (4.3)  that 

Pv(x,y)  =  r]P1(x,y)  +  (1  -  i])P0(x,y)  ,  x,y  e  IN*  (6.1) 

where  (P*( x,y))  (resp.  (P°(x,  y)))  are  the  one-step  transition  probabilities  under  ~g  (resp.  g). 

The  mapping  h  :  IN*  — *  1R  and  the  scalar  J  solve  the  Poisson  equation  (associated  with  the 
policy  fv)  with  forcing  function  c  :  IN*  — >  1R  if 

h(x)  +  J  =  c(x)  +  ^2  Pv(x,y)h(y),  ®GlN*.  (6.2) 

Clearly  the  solution  pair  (h,  J )  to  (6.2)  depends  on  rj,  and  it  is  the  purpose  of  this  section  to  establish 
its  regularity  properties  with  respect  to  rj.  This  information  is  essential  both  for  establishing  the 
validity  of  the  Certainty  Equivalence  Principle  [20,  26]  and  for  studying  the  convergence  of  the 
stochastic  approximation  algorithm  (4.6)  by  the  method  of  Metivier  and  Priouret  [21].  From  now 
on,  this  dependence  of  J  arid  h(x)  on  the  bias  t]  is  denoted  simply  by  J(rf)  and  h(rj,x)  for  all  x  in 
IN*. 

Define  the  first  return  time  to  state  x  —  0  as  the  .^-stopping  time  T  given  by 

T  :=  inf{n  >  0  :  X(n)  =  0}  (6.3) 


so  that  T  —  T\  in  the  notation  of  Section  5.  Set 

Tt{x)  :=  £x[Te]  -  E2[T%  x  G  IN*  i  =  1, . .  .,7  (6.4) 

where  the  notation  that  follows  Proposition  5.3  has  been  used.  For  easy  reference  recall  the  estimate 
(5.5),  valid  under  (R1)-(R4),  i.e.,  for  each  l  —  1, . .  .,7,  there  exists  a  positive  constant  Ct  so  that 

Tfix)  <  <7/(1+  |  *  \e),  x  G  IN*  .  (6.5) 


As  pointed  out  already  in  Section  5,  during  each  slot,  at  most  one  customer  may  leave  the 
system,  so  that  for  each  t  —  0,1,...,  |A"(t)|  is  necessarily  no  larger  than  the  forward  recurrence 
time  (expressed  in  slots)  to  the  empty  state,  and  in  particular  |X(0)|  <  T.  Since  the  mapping 
x  — »  c(|x|)  defined  in  (R5)  is  a  non- decreasing  function  of  |z|,  it  is  plain  from  (6.5)  that  whenever 
6  +  1  <7,  the  bounds 


El 


•T —1 


L  t=0 


J>(X(*))|  \<El  52c(\X<t)\) 


T- 1 


L<=0 


<  £x  [ Tc(T )]  <  00  ,  x  G  IN 


I< 


(6.6) 
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hold,  and  the  definition 


C(V,x):=  ^[Ec 
•-<=0 

is  thus  well  posed.  An  explicit  expression  for  a  solution  to  the  Poisson  equation  is  available  and  is 
now  given  [9,  27]. 

Theorem  6.1.  Assume  conditions  (R1)-(R6)  to  hold  with  p  <  1  and  8  +  1  <  7.  A  solution  pair 
(h(r)) ,  J (77))  to  the  Poisson  equation  (6.2)  with  h(r),  0)  =  0  is  given  by 

J(v)  =  and  h(r),x)  =  C(rj,x)- J(ri)Tx(x)  ,  16#  (6.8a) 

and  the  equality 

</(/”)  =  limn  E*  -j-f>(X(t))  =J(n)  (6.8 b) 

'  1  <=o 

holds  true. 

In  view  of  (6.8b)  and  of  the  ergodic  properties  of  this  system  under  fn,  it  is  plain  that  J(r})  is  also 
the  expectation  of  c(X)  under  the  invariant  measure  corresponding  to  the  policy  fn. 

6.2.  Lipschitz  continuity 

The  representation  (6.8)  will  be  put  to  use  in  studying  the  regularity  of  the  solution  pair  to 
the  Poisson  equation  (6.2).  To  simplify  the  presentation  of  the  main  result  of  this  section,  set 

K(x)  :=  €x  [T2c(T)\  ,  x  E  1NA  .  (6.9) 

Theorem  6.2.:  Assume  (R1)-(R6)  with  p  <  1  and  8  +  2  <  7.  Then  for  all  x  in  IN^,  K(x)  <  00 
and  the  function  77  — >  C(rj,x)  is  Lipschitz  continuous  on  [0, 1]  with  Lipschitz  constant  AI{(x),  i.e., 

I  C(p,x)-  C(q',x)  |<  AK(x)  j  rj-  rj  \  ,  r),  rf  €  [0, 1]  .  (6.10) 

Proof:  Fix  x  in  1NA.  That  K(x)  and  Sx[Tc(T) ]  are  both  finite  is  plain  from  (6.5)  under  the 
assumption  8  +  2  <  7.  The  result  (6.10)  is  established  below  for  c  non-negative  in  the  form 

|  C(v,x)-C{rj',x)  \<  2K(x)  |  rj-  rf  |  ,  77,77' 6  [0,1]  (6.11) 

so  that  the  result  for  a  general  c  is  now  immediate.  Therefore,  it  suffices  to  assume  c  to  be  non¬ 
negative  in  the  remainder  of  this  proof.  The  arguments  proceed  in  three  steps. 


(X(0)  ,  xe# 
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Step  1:  Fix  77  in  [0, 1].  Notice  that  for  every  IN^-valued  sequence  {#(7),  i  =  0, 1, . . .}  with  x(0)  =  x, 
the  relations 

m  —  1 

P%[X(i)  =  x(i),  1  <  i  <  m]  =  Pn(x(i),x(i+  1)),  m  =  1,2, ..  .(6.12) 

i=0 

hold  as  a  result  of  the  Markov  property  of  the  chain  {X(n),  n  =  0, 1, . . .}  under  Pv .  The  product 
form  of  (6.12)  and  the  linear  structure  of  (6.1)  now  imply  that  for  each  m  =  1,2,...,  the  mapping 
77  — >■  P£[X(i)  =  a :(i),  1  <  i  <  m]  is  a  polynomial  of  degree  m  in  77  over  [0, 1]  and  has  derivatives  of 
all  orders. 

Set  A  =  [X(i)  =  x(i),  1  <  i  <  m]  in  (6.12)  and  observe  that 
>  <7  m_1 

<6-i3> 

=  +  1))-7>°(x(l),x(<  +  1))]  n  -P’,(x(i).x(i+ 1))- 

7=0  i=0,i^t 


This  suggests  defining  for  every  t  =  0,1,...,  the  policy  0<  (resp.  1*)  as  the  Markov  policy  that 
operates  according  to  f°  (resp.  f1)  at  time  t,  and  according  to  fn  otherwise.  With  this  notation, 
(6.13)  now  takes  the  form 

,  m  —  1 

TWO  =  *(«').  1  <  •'  <  ">]  =  E  (6.14) 

arl  7=0 


The  definition  of  the  policies  0 1  and  lt  implies  that  [A]  =  P®‘  [A]  whenever  m  <  t,  so  that  (6.14) 
can  also  be  rewritten  as 


1  <i<m]  =  YJPlt  [A]  -  P0X'  [A], 

^  7=0 


m  <  n. 


(6.15) 


Step  2:  To  proceed,  define 


Cm(77,*):=JB2 


TAm— 1 

l[T<m)  Y,  <*(*)) 


7=0 


=  E2 


k- 1 


E1[r  =  fc]EcWi)) 


k-l 


7=0 


(6.16) 


for  all  m  =  1, 2, . . ..  The  definition  of  T  implies  that 

[T  =  k]  =  [X(t)  ±  0, 0  <  t  <  k,  X(k)  =  0],  k=  1,2,...  (6.17) 
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so  that 


m  k— 1 

Cm(r),x)=^2  £  P^[X(i)  =  x(i),l<i<k]^2c(x(t))  (6.18) 

k= 1  (x(l),...,x(k))ezk  t= 0 

where  the  second  sum  is  taken  over  the  set  Zk  given  by 

Zk  :=  {(x(l),a;(2),...,a:(A;))  €  (1NK)k  :  x(i)  ^0,1  <  i  <  k  and  x(k)  =  0}. 

k  =  1,2,... (6.19) 

By  arguments  made  earlier,  it  is  plain  that  on  the  event  [T  =  &],  the  bounds  |X(f)|  <  k,  0  <  t  <  k, 
must  necessarily  hold,  and  therefore  (6.18)  reduces  to 


®) 


m 


E 


k- 1 

£  P2[X{i)  =  x(i),l<i<k]^2c(x(t)) 

(ar(l ) x(k))ZZ'k  t= 0 


(6.20) 


where  the  finite  set  Z'k  is  given  by 

Z'k  :=  {(*(1),  *(2), . . . ,  x(k))  e  Zk  :  |*(i)|  <  *,  1  <  i  <  k}.  k=  1,2,..  .(6.21) 

Hence,  in  view  of  remarks  made  earlier  in  the  proof,  the  mapping  rj  — ►  Cm(rj,  x )  is  a  polynomial  of 
degree  m  in  r\  since  it  is  the  sum  of  a  finite  number  of  polynomial  functions,  each  one  of  degree  no 
greater  than  m. 

Since  Cm(r],  x)  is  a  polynomial  in  tj  for  each  to  =  1,2,...,  the  derivative  Cm(j ],  x)  exists  in  the 
interval  [0, 1].  To  compute  it,  differentiate  (6.20)  and  use  (6.14)-(6.15)  to  conclude  that 


TO— X 


Cm(r1,x)=  £i£« 
t= o 


'TAto  — X 

£  1[T  <  m]c(X(s)) 

.  s= 0 


'TAto— 1 

E  [X|r  <  m]c(X(s))] 

.  s=0 


(6.22) 


The  very  same  argument  that  lead  from  (6.14)  to  (6.15)  now  implies  that  whenever  0  <  k  <  t,  the 
relation 


Eh 

X 


k- 1 


l[T  =  fc]£c(X(s)) 


s= 0 


Ex' 


k- 1 


i  [r  =  *]£c(x(s)) 


s=0 


(6.23) 


holds.  Therefore  (6.22)  can  be  rewritten  (in  the  manner  of  (6.16))  as 


TO  — 1 


Cm(r],x)=  £ 


’X  Am— 1 


£  l[t  <  T  <  to]c(X(s)) 


s=0 


E°x' 


TAto-1 


£  1  [t<T<  m]c(X(s)) 


s= 0 


.  (6.24) 
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On  the  other  hand, 

m-1  ITAto-1  1  m- 1  T  TAm-1 

V  El-  V  1[<  <  T  <  m]c(XM)  <  £  £j'  l[t  <  r]  53  l[r  <  m]c(T) 

f=0  .  5=0  J  <=0  L  5=0 

m— 1 

<  Y,  €*  [1[*  <  T  ^  m]^c(T)] 

«=o 

<  Sx  [ T2c(T )]  (6.25) 

by  elementary  calculations.  A  similar  bound  holds  for  the  terms  corresponding  to  the  policies  0<  in 
(6.24).  It  then  follows  from  (6.24)  and  (6.25)  that  the  derivative  Cm(r},x)  of  Cm(r),x )  is  bounded 
on  [0, 1]  by  2 K(x),  and  this  uniformly  in  m,  i.e., 

Cm(v, *)|  <  2 K(x)  m  =  0,1,... (6.26) 

for  all  r)  in  [0, 1] 

Step  3:  The  easy  estimates 

T— 1 

o  <  CM  -  CmM  =  E*[l[T  >  m)  £  c(X(i))]  <  E*[l[T  >  m]Tc(T )] 

<=o 

m  =  0,1,... (6.27) 

imply  via  the  Monotone  Convergence  Theorem  that  lirnm  Cm{Tl,x)  =  C(ij,x )  uniformly  in  rj  since 
£x[Tc(T)\  <  oo.  Consequently,  with  0  <  rj  <  r}'  <  1, 

\C{rj,  x)  —  C(i]  ,3^)|  =  limTO  \C  m(r) ,  x^  —  Cm(jj  ,a^)| 

P'  • 

=  limm  /  Cm(y,  x)dy  (6.28) 

Jtj 

<  2K(x)\rj  —  rj'\ 

upon  making  use  of  (6.26),  and  this  establishes  (6.11).  ■ 

Note  that  the  estimate  (6.27)  shows  that  C(i),  x )  is  continuous  under  the  weaker  condition  S  + 1  <  7. 

6.3.  Corollaries 

Theorem  6.2  has  several  useful  consequences  which  are  now  given  in  the  next  few  corollaries. 
The  first  such  corollary  is  obtained  by  combining  Theorems  6.1  and  6.2  in  a  straightforward  manner; 
details  are  left  to  the  interested  reader. 
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Corollary  6.3.  Under  the  hypotheses  of  Theorem  6.2,  the  functions  rj  — >■  J(r ?)  and  r\  — *•  h(ij,x), 
with  x  ranging  in  1NA,  are  Lipschitz  continuous  on  [0, 1],  i.e.,  for  all  rj  and  rj '  in  [0, 1], 

\J(l)-JW)\  (6.29) 

and 

\h(rj,x)  -  h(rj',x)\  <  AKh(x)  ■  \rj  -  rj'\  ,  x  €  lN/<r  (6.30) 

with 

Kh{x)  :=  K{x)  +  •  Tx{x\  x  G  1N/C  .  (6.31) 

The  behavior  of  the  Lipschitz  constants  K(x)  and  Kh{x),  and  of  the  solution  h(rj,x)  for  |a:| 
large  is  needed  in  some  of  the  arguments  given  in  Section  7.  The  estimates  on  the  Lipschitz 
constants  are  given  first. 

Corollary  6.4.  Assume  (Rl)-(R6)  with  p  <  1  and  6  +  2  <  7.  There  exists  a  positive  constant  C 
such  that 

|jr(a?)|  <  C  (1  +  M*+2)  ,  X  e  IN*  (6.32a) 

and 

\Kh(x)\  <  C  (1  +  |x|5+2)  ,  x  G  1N/C  .  (6.326) 

Proof:  Fix  x  £  1NK.  Note  from  (R6)  and  (6.9)  that 

K(x)  <  LSX  [T2(l  +  T5)] 

<  2 LZX  [Ts+2]  <  2LCs+2  (1  +  |*r2)  (6.33) 


with  the  last  inequality  following  from  (6.5),  so  that  (6.32a)  holds  wherever  C  >  2LCs+2-  The 
inequality  (6.32b)  is  readily  obtained  from  (6.31)  upon  making  use  of  (6.5)  and  (6.32a).  ■ 

The  growth  of  solutions  to  the  Poisson  equation  can  now  be  described. 

Corollary  6.5.  Assume  (R1)-(R6)  with  p  <  1  and  6  +  1  <  7.  There  exists  a  positive  constant 
Bh  such  that 

\h(V,x)\<  Bk(l+\x\s+1)  ,  x  G  1N/C  (6.34) 

for  every  rj  in  [0, 1]. 
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Proof:  By  the  remark  following  the  proof  of  Theorem  6.2,  the  mapping  g  — >  C(t],  0)  is  continuous 
on  [0, 1]  and  therefore  bounded  there.  For  each  x  in  IN7'' ,  straightforward  arguments  show  that 

"T— 1  “I 

I  h(rj,x)  \<E2  £|c(X(t))|  sup  \C(V,0)\ 

Lfeo  J  °<v<i  (6.35) 

<  €x[Tc(T)\  +  DlT1(x) 

with 

sup  lc’(7?’°)l-  (6.36) 

4i(U)  0<»?<1 

The  passage  from  (6.35)  to  (6.34)  is  validated  by  the  same  arguments  as  the  ones  given  in  the  proof 
of  Corollary  6.4.  g 

Finally,  a  bound  on  the  moments  of  the  rvs  {h(rj(n) ,  X (n  +  1)),  n  =  0, 1, . . .}  is  obtained. 

Corollary  6.6.  Assume  (Rl)— (R6)  with  p  <  1  and  r(<!) -f  1)  +  1  <  7  for  some  non-negative  integer 
r.  Then  there  exists  a  positive  constant  Hr  such  that  the  bound 

suPn  Ea[\h(rj(n),X(n+  l))f]  <  Hr  (6.37) 

holds. 

Proof:  For  every  p  in  [0, 1],  Corollary  6.5  immediately  implies 

|  h(ih  x)  r<  |2 Bh\r  (1+  I  X  r^+1))  ,  X  e  K*  (6.38) 

so  that 

Ea  [|  h{-q{n ),  X(n  +  1))  |r]  <  \2Bh\r  (l  +  Ea  [|  X{n  +  1)  .  n  =  0, 1, . .  .(6.39) 

The  conclusion  (6.37)  is  now  obtained  from  Theorem  5.1  upon  selecting  Hr  —  |2P/l|r(l  +  /f7)  since 
r(S  +  1)  <  7  —  1.  h 

6.4.  The  general  model 

In  [27]  the  authors  have  developed  a  methodology  for  proving  smoothness  properties  of  solu¬ 
tions  to  the  Poisson  equation  in  a  fairly  general  setting.  This  is  done  by  invoking  the  recurrence 
properties  of  the  underlying  Markov  chain  in  order  to  obtain  continuity,  Lipschitz  continuity  and 
differentiability  properties.  The  ideas  of  the  present  paper  are  however  amenable  to  generalization 
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as  follows.  Suppose  (as  in  (4.3))  that  at  each  step  the  stationary  policy  ~g  is  used  with  some  proba¬ 
bility  77  while  the  stationary  policy  g_  is  used  with  probability  (1  —  77).  Then,  as  in  (6.1),  the  one-step 
transition  probabilities  take  the  form 

Pxyiv)  =  VPxy(0)  +  (1  -  f7)p*«(l),  *,1/6  V  e  V  ,  (6.40) 

where  77^(0)  (resp.  pXy  ($))■,  x,  V  in  IN7'',  are  the  one-step  transition  probabilities  under  g  (resp. 
g).  Under  these  assumptions  the  original  MDP  collapses  to  the  model  where  the  action  space  is 
U  =  [0, 1],  and  the  transitions  are  realized  according  to  (6.40). 

However,  the  structure  (6.40)  may  arise  through  a  mechanism  different  from  the  one  outlined 
above.  Given  this  structure,  with  some  abuse  of  notation,  let  77  also  denote  the  stationary  policy 
which  uses  action  77  at  every  stage.  Fix  77  in  [0,1]  and,  as  in  (6.13)-(6.14),  let  04  (resp.  1<)  denote 
the  policy  which  uses  action  0  (resp.  action  1)  at  time  t,  and  otherwise  uses  action  77.  Condition 
(G4)  then  takes  the  form 

(G4):  The  distribution  of  the  rv  T  under  the  policies  P1*  and  P°*  is  stochastically  monotone 
in  t,  i.e.,  for  all  increasing  functions  r  :  1R+  — ►  IR,  the  mappings  t  —y  P1‘[r(T)]  and 
t  —y  P°‘[r(T)]  are  monotone. 

As  the  arguments  in  (6.41)  below  reveal,  a  conditions  much  weaker  than  (G4)  will  suffice.  Let  n 
be  the  collection  of  policies  {lt,  0*;  t  =  1,2, . .  .;0  <  77  <  1}. 

Theorem  6.2bis.  Consider  the  general  model  and  assume  conditions  (Gl),  (G2),  (G4)  and 
(G3)  for  l  =  1,2, ..  .,7  to  hold  (with  Ti  :=  T  as  in  Section  5.f).  Then  the  conclusion  of  Theorem 
6.2  holds. 

Proof:  For  the  sake  of  simplicity,  set  cs  =  1  in  (Gl)  as  the  extension  to  the  case  cs  1  is  obvious. 
Under  the  conditions  (G1)-(G3),  the  proof  is  almost  identical  to  that  of  Theorem  6.2,  except  that 
(Gl),  (G4)  and  the  monotonicity  of  c  are  used  in  (6.25)  to  obtain 

m  —  1  ’TArn-1  "  m  —  1  TAm— 1 

£  El*  £  1  [t<T<  m]c(X(s))  <  £  E\'  1  [t  <  T]  £  1  [T  <  m]c(T) 

t  =  0  .  5  =  0  .  t= 0  .  5=0 

771  —  1 

<  £  | E\*  [1[/  <  T]Tc(T)]\ 

t=o 

<suPt  El*  [T2c(T)\  <  Ci(x)  1  =  2  +  6  (6.41) 

for  all  x  e  1Na  .  ■ 
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Corollary  6.3  continues  to  hold  as  stated  under  the  hypotheses  of  Theorem  6.2bis.  Further¬ 
more,  it  is  easy  to  see  that  the  growth  estimates  of  Corollaries  6.4-6. 5  continue  to  hold  under  the 
assumption  that 

C,(x)  <Ci-(l+\x  |c"*') ,  x  G  1NJ<  (6.42) 

for  some  positive  constants  Ci  and  cm. 

7.  CONVERGENCE  OF  THE  STOCHASTIC  APPROXIMATIONS 
7.1.  The  ODE  method 

This  section  is  devoted  to  proving  the  convergence  of  the  recursive  scheme  (4.5)-(4.6)  when 
the  policy  a  is  in  use.  Recall  that  the  mapping  q  —>  J(fn)  is  monotone  increasing. 

The  following  additional  assumption  (R7)  is  imposed  in  order  to  carry  out  the  analysis. 

(R7):  The  equation 

J(D  =  V,  0  <  t?  <  1  (7.1) 

has  a  unique  solution  r/*. 

Note  that  the  continuity  of  r/  J(fv)  now  implies 

[7(/’’)-V]0l->«>0,  >)/>)■£  [0, 1].  (7.2) 

The  uniqueness  of  the  solution  to  (7.1)  is  tantamount  to  local  strict  monotonicity  and  in  practice, 
is  often  verified  by  establishing  some  stronger  monotonicity  property  on  rj  — >  J(fv)  such  as  (R7b) 
below. 

(R7b):  The  mapping  [0, 1]  — IR  :  77  — >•  J(fv)  is  strictly  monotone  increasing. 

In  Section  8,  condition  (R7b)  is  shown  to  hold  for  a  steering  problem  which  arises  from  a  con¬ 
strained  optimization  problem. 

The  proof  of  Theorem  4.1  given  below  uses  a  version  of  the  ODE  method  which  was  proposed 
by  Metivier  and  Priouret  in  [21].  The  arguments  combine  the  deterministic  lemma  of  Kushner 
and  Clark  [16]  with  a  probabilistic  result  based  on  properties  of  the  Poisson  equation  (6.2).  This 
key  result  is  given  the  next  proposition,  the  proof  of  which  is  delayed  until  the  second  part  of  the 
section.  To  state  the  result,  consider  the  rvs  (T(n),  n  =  0, 1, . . .}  given  by 

Y(n)  :=  J(f'W)  -  c(X(n  +1))  n  =  0, 1, . .  .(7.3) 
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and  pose 


fc-i 


m(n,t )  :=  max{&  >  n  :  !>0. 


»  =  0,1,... (7.4) 


Theorem  7.1  Assume  (R1)-(R6)  with  p  <  1  and  2<5  +  3  <  7.  For  each  t  >  0,  the  convergence 


limn  J  sup  |  y^aiY(i)  |  ]  =  0  Pa 
\n<k<m(n,t)  i=n  J 


a.s. 


(7.5) 


takes  place. 

Proof  of  Theorem  4.1.  As  shown  in  [16,  21],  the  convergence  (7.5)  underlines  the  P"-a.s. 
convergence  of  the  estimates  {77(77),  n  =  0,1,...}  to  77*.  The  reader  is  invited  to  consult  these 
references  for  a  complete  exposition  of  the  arguments  which  are  now  briefly  summarized:  Interpolate 
the  estimate  sequence  {77(77),  n  =  0, 1, . . .}  by  a  piecewise  linear  function  77^  :  [0, 00)  — »  ]R  such 
that  r)(°\tn)  =  77(71)  at  time  tn  =  J2?=oai  for  all  n  =  0, 1, . . .  (with  t0  =  0).  Moreover,  define  a 
sequence  of  left  shifts  {?7^n^(-)i  »  =  0, 1, . . .},  i.e.,  77  <n)(t)  =  rf{t  —  tn)  for  all  t  >  0,  in  order  to  bring 
the  “asymptotic  part”  of  {77(71),  n  =  0, 1, . . .}  back  to  a  neighborhood  of  the  time  origin. 

Now  observe  that  the  recursion  (4.6)  can  be  written  in  the  form 


77(77+  1)  = 


1 


77(77)  +  an+1[{V-J(P^))+Y(n)] 


0 


77  =  0,1,... (7.6) 


and  that  from  any  convergent  subsequence  {??(m)(.),  m  =  0,1,...}  a  further  convergent  subse¬ 
quence  {7/m”>(-)i  p  =  0, 1, . . .}  can  then  be  extracted  by  standard  boundedness  and  equicontinuity 
arguments.  It  is  then  easy  to  see  from  Theorem  7.1  that  the  limit  T]*(-)  along  this  subsequence, 
and  for  that  matter  the  limit  of  any  convergent  subsequent,  satisfies  the  ODE 

,?*(0  =  V  -  J(/',*(<)),  <>0,  77*(0)  G  [0, 1]  .  (7.7) 

Owing  to  (7.2),  this  ODE  is  asymptotically  stable  with  a  unique  stable  point  77*  in  [0,1],  A  simple 
shifting  argument  now  implies  rj*(t)  =  77*  for  all  t  >  0  and  this  completes  the  proof.  These 
arguments  are  standard  and  are  therefore  omitted  here  in  the  interest  of  brevity.  ■ 

The  remainder  of  this  section  is  devoted  to  a  proof  of  (7.5). 

7.2.  A  proof  of  Theorem  7.1 
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The  Poisson  equation  (6.2)  implies  the  relations 


E"[h{r},  X(n  +  1))  |  Tv\  =  Hv,  X(n))  +  J(t})  -  c(X(n))  n  =  0,1,.. .  (7.8) 

for  all  0  <  rj  <  1.  It  then  follows  from  (6.8b)  and  (7.3)  that 
-Y(n)  =  c(X(n  +  1))  -  J(t](n )) 

=  h(Hn),X(n+  1))  -  E^n'>[Hv(n),X(n+  2))  |  Xn+x] 

=  Z^  +  Z^  +  Z^  n  =  0,1,...  (7.9) 


with 

ZW  :=  h(v(n),  X(n  +  1))  -  E^[Hv(n),  X(n  +  1))  |  En],  (7.10 a) 

Z(i)  E^{Hv(n),  X(n  +  1))  |  En]  -  E^n+1\h(V(n  +  1),  X(n  +  2))  |  ?n+1]  (7.106) 

and 

Z(3)  :=  E*n+V{h(v(n  +  1),  X(n  +  2))  |  JTn+1]  -  £^>[%(n),  X(n  +  2))  |  Tn+x]  (7.10c) 
for  all  n  =  0, 1, . . ..  It  now  suffices  to  show  that 

lim„  (  sup  |  I )  =  0  Pa-  a.s.  (7.11) 

Yn<£<m(n,<)  / 


for  all  t  >  0  and  all  k  =  1,2, 3. 

This  will  be  done  in  three  steps.  To  facilitate  the  presentation,  define  the  rvs  { Snk^ ,  n  — 
0,1,...}  by 

Sik)  :=J2aiZik)  n  =  1,2,... (7.12) 

«'= 0 


for  k  =  1,2,3,  with  =  S ^  -  0. 

Step  1:  The  rvs  {Z„\  n  =  0,1,...}  form  a  (Pa,En)  martingale-difference,  whence  (-S'!1),  n  = 
0,1,...}  is  a  zero-mean  (Pa ,  y7„)-martingale.  Routine  calculations  show  that 


supn  E°[ |  |2]  -  sup„  E° 


71  —  1 


1! 


t=0 


(7.13) 


OO 

<  sup nEa  \h(v(n),X(n+  1))|2  '4 


i=o 


(7.14) 


<  4H2-  al 

i= 0 


(7.15) 
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The  passage  from  (7.14)  to  (7.15)  uses  the  estimate  (5.37)  given  in  Corollary  6.6  (with  r  =  2 
since  26  +  2  <  7  —  1).  It  is  plain  from  (4.7)  that  the  left  handside  of  (7.13)  is  finite,  and  the 
(P^P^-martingale  n  =  0, 1, . . .}  is  thus  uniformly  integrable  under  Pa.  By  the  Martingale 

Convergence  Theorem,  the  rvs  {5iX\  n  =  0, 1, . . .}  converge  a.s.  under  Pa  (to  an  a.s.  finite  limit), 
in  which  case  they  form  a  Cauchy  sequence  Pa- a.s.  and  (7.11)  follows  for  k  =  1. 

Step  2:  For  k  =  2,  note  first  the  relations 

i=n 

t 

=  -  ][><_ x  -  ai)E^[h(r1(i),X(i+  1))  |  Pt]  (7.16) 

i=n 

+  an_i E^[h(V(n), X(n  +  1))  |  Tn\  -  aeE^+^[h(rj(£  +  1), X(£  +  2))  |  pm] 
valid  for  all  0  <  n  <  t.  Define  the  rvs  {Kn,  n  =  0, 1, . . .}  by 

Kn  :=  Ev^ [h(r](n),X(n  +  1))  |  Tn\  n  =  0, 1, . .  .(7.17) 

and  set 

Br  —  supn  Ea[\I(n  |r].  r  —  1,2,... (7.18) 

It  is  clear  from  (6.37)  (with  r  —  1, 2)  and  Jensen’s  inequality  that  B\  <  H\  <  00  and  B2  <  II2  <  00. 
With  this  notation,  (7.16)  can  be  rewritten  as 

e 

I  se+i  -  Sn]  l<  Un-il^nl  +  Y^-1  -  <b)l^*l  +  «r|7Q+i|  (7-19) 

i—n 

for  all  0  <  n  <  £  since  an  j.  0.  If  the  rvs  {S„,  n-  1,2...}  and  { Rn ,  n  =  0, 1, . . .}  are  now  defined 
by 

n 

Sn  =  ]T(a,-_i  ~  ai)\Ki\  n  =  1,2,... (7.20) 

i~\ 

and 

n 

Rn  =  Y  l«i|2|^.+i|2,  n  =  0, 1, . .  .(7.21) 

!  =  0 

then  (7.19)  becomes 

I  ^+1  -  ^  l<  «n-i|^n|+  I  Se  -  Sn+i  I  +ae\Ke+i\  ,  0  <n<£.  (7.22) 
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The  definition  (7.20)  implies 


Ea[Sn]  <  Bi  ^2  (ai-i  -  ai)  =  Bi(a0  -  an)  <  Bia0.  n  =  0, 1, . .  .(7.23) 

i= o 

Since  Sn  <  Sn+ 1,  the  limit  S oo  :=  lim,,.^  exists  and  therefore  -^[.Soo]  <  B\ao  by  using  the 
Monotone  Convergence  Theorem  on  (7.23).  Consequently,  Soo  <  oo  Pa- a.s.  and  the  rvs  {5n,  n  = 
0, 1, . . .}  form  a  Cauchy  sequence  Pa- a.s.,  i.e., 

limn  sup^>n  |  St  -  5n+ 1  |=0  Pa  -  a.s.  (7.24) 

To  handle  the  first  and  last  terms  of  (7.22),  observe  that  Rn  <  Rn+i,  hence  the  limit  Roo  :=  lim„i?„ 
exists  and  satisfies 

OO 

Ea[Roo]  <  <  oo  (7.25) 

i= 0 

by  virtue  of  the  Monotone  Convergence  Theorem.  Consequently,  lim„#n  =  R ^  <  oo  Pa- a.s., 
whence  limnan_i  |  A' n\  =  0  Pa- a.s.  or  equivalently 

lim„  sup^>n  ae_i\Kt\  =  0  Pa  -  a.s.  (7.20) 

by  the  Cauchy  convergence  criterion.  Making  use  of  (7.24)  and  (7.26)  readily  leads  (via  (7.22))  to 
the  conclusion  (7.11)  for  k  =  2. 

Step  3:  For  k  =  3,  observe  that  (7.8)  and  the  estimates  of  Corollary  6.3  readily  yield  the  estimates 

|  E^[h(r,,  X(n  +  1))  |  Tn\  -  E\h%  X(n  +  1))  |  pn]  \ 

=  |  h(r],X(n))- h(tj,X(n))+  J(fj)  \ 

<4K(X(n))-\ri-ii\  n  =  0, 1, . .  .(7.27) 

for  all  rj  and  r/  in  [0, 1],  where 

K(x):=K(x)  +  2^T1(x),  x  G  IN  .  (7.28) 

The  recursion  (4.6)  implies 

|  r](n  +  1)  -  rj(n )  |<  a„+i  |  V  -  c(X(n  +  1))  |  n  =  0, 1, . .  .(7.29) 


35 


and  the  inequality 

\ZW\<4*n+iQ(X(n+l)) 

is  now  obtained  from  (7.27),  upon  setting 


n  =  0,1,... (7.30) 


Q(x):=ff(a)(V+|c(s)|),  ier. 

Under  (R5),  with  the  help  of  (6.5)  and  (6.32a),  it  is  a  simple  exercise  to  check  that 


<2(z)<<?(i  +  M25+2),  xelN 


K 


(7.31) 


(7.32) 


for  some  positive  constant  C.  Consequently, 


Ec 


i= 0 


»= 0 


1  +  \X(i  +  1)\26+2 


<c(i  +  /<-,)•£<.? 


(7.33) 

n  =  0,1,... (7.34) 


j=0 


where  the  passage  from  (7.33)  to  (7.34)  is  a  simple  consequence  of  (5.1)  (since  26  +  2  <  7  —  1). 
Now,  in  exactly  the  same  way  as  in  Step  2  of  the  proof,  this  uniform  bound  (7.34)  implies 


l 

limn  supf>n  =0  Pa  -  a.s. 


(7.35) 


and  (7.11)  obviously  holds  for  k  =  3.  ■ 

7.3.  The  general  model 

The  results  of  this  section  rely  on  boundedness  and  smoothness  properties  of  solutions  to  the 
Poisson  equation,  but  the  structure  of  the  proof  is  otherwise  quite  general.  In  fact,  consider  a  set 
of  stationary  policies,  parameterized  by  77  6  [0,1]  and  let  ( J(r 7),  h(r],x))  denote  the  solution  to  the 
Poisson  equation  under  policy  rj.  Such  parameterization  may  arise  as  in  Section  6.4,  but  for  the 
purposes  of  the  present  section  this  is  immaterial.  Suppose  that  the  properties  (i)— (iii)  below  can 
be  established,  as  was  done  for  the  queueing  model  under  consideration,  where 

(i):  For  each  rj  in  [0, 1],  J(rj)  equals  the  cost  under  rj, 

(ii) :  For  each  r]  in  [0, 1],  x  — »  h(ri,  x)  is  at  most  polynomial  in  x , 

(iii) :  For  each  x  in  !N/f,  ij  —*■  h(i],x)  and  r]  —>  J(rf)  are  Lipschitz  in  77,  where  the  Lipschitz 

constant  of  h(r},x)  is  at  most  polynomial  in  x. 
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Then  the  conclusions  of  Theorem  7.1  (and  hence  the  conclusions  of  Theorem  4.1)  hold  with  proofs 
almost  unchanged,  provided  appropriate  bounds  on  moments  of  the  state  process  are  available. 
Condition  (ii)  is  obtained  in  Corollary  6.6  and  its  generalizations,  and  validates  Steps  1-2  in 
Section  7.2  above.  Conditions  (iii)  allows  the  argument  in  Step  3  to  be  carried  through  and  the 
only  changes  required  involve  the  constants  and  the  exponents  6 ,  7  and  r. 

8.  CONVERGENCE  OF  THE  ADAPTIVE  POLICY  AND  APPLICATIONS 

This  final  section  contains  a  proof  of  Theorem  4.2,  as  well  as  the  discussion  of  an  application  that 
arises  in  constrained  optimization. 

8.1.  A  proof  of  Theorem  4.2. 

The  proof  follows  from  general  results  obtained  by  the  authors  on  the  Certainty  Equivalence 
Principle  when  specialized  to  “simply  randomized”  policies  [26].  First  note  that  the  (assumed) 
convergence  lim„  rj(n)  =  rj*  in  probability  under  Pa,  when  combined  to  Theorem  7.2  of  [26],  implies 
the  key  convergence  condition  (C)  [Ibid.,  Section  4],  Consequently,  the  convergence  (4.11)-(4.12) 
follows  from  Theorem  3. Ibis  in  [Ibid.]  provided  the  hypotheses  of  Theorems  4.2  and  6. Ibis  of  [Ibid.] 
are  satisfied.  These  hypotheses  consist  in  the  tightness  of  the  rvs  {X(t),  t  —  0,1,...}  under  Pa 
and  of  bounds  on  the  moments  of  the  rvs  {c(X (t)) ,  h(rj* ,  X(t)),  t  =  0, 1, . . .}  under  various  policies. 
It  is  easy  to  check  that  these  conditions  are  all  implied  by  the  following  condition: 

There  exist  e  >  0  and  a  positive  constant  C€  such  that  for  every  non-idling  policy  7r  in  V,  the 
bounds 


sup,  E7’  []X(t)|1+e]  <  Ce, 

(8.1) 

sup,  E*[\c(X(t))\1+i]<Ce 

(8.2) 

sup,  E*[\h(r,*,Xm1+*}  <Ce 

(8.3) 

hold. 

Observe  that  by  virtue  of  Theorem  5.1,  the  bound  (8.1)  readily  holds  whenever  1  +  e  <  7  —  1. 
By  assumption,  c  is  of  polynomial  growth  with  rate  6,  so  that  (8.2)  holds  if  £(1  +  e)  <  7  —  1  by  the 
remark  made  earlier.  To  obtain  the  third  bound  (8.3),  observe  from  (6.34)  that  for  every  e  >  0, 

\h(rj*,X(n))\1+€  <  |25fc|1+e(l  +  |X(n)|(fi+1Ke+1)),  n  =  0,1,... (8.4) 
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and  (8.3)  follows  with  (1  +  e)(l  +  6)  <  7  —  1  by  again  making  use  of  Theorem  5.1.  Consequently 
(8.1)-(8.3)  will  hold  provided  e  is  chosen  positive  such  that  1  +  (1  +  £)(1  +  e)  <  7. 

An  identical  analysis  applies  for  the  long-run  average  cost  associated  with  d\  details  are  left  to 
the  interested  reader.  Jg 

8.2  An  application  to  constrained  optimization 

Consider  the  following  situation  discussed  by  Nain  and  Ross  in  [22].  Several  types  of  traffic, 
say  voice,  video  and  data,  compete  for  the  use  of  a  single  resource  (or  server).  The  performance 
requirements  for  this  system  are  defined  by  the  minimization  of  a  weighted  average  of  the  number  of 
video  and  data  packets  subject  to  the  constraint  that  the  average  number  of  voice  packets  waiting 
for  service  does  not  exceed  V .  This  situation  can  be  modelled  by  a  system  of  K  competing  queues 
with  P  =  0.  For  a  precise  definition  of  the  performance  measures,  set 

K- 1 

c(x )  :=  xk  and  d(x)  :=  ^  dkxk  x  6  1N/C  (8.5) 

*=1 

where  d%, . . . ,  dj(-\  are  positive  constants.  Denote  by  Jc( iv)  (resp.  Jd( 7r))  the  long-run  average 
cost  (4.1)  associated  with  the  cost  c  (resp.  d)  when  using  the  policy  n  in  V.  The  constrained 
optimization  problem  ( Pv )  is  then  formulated  as 

(Pv)  -  Minimize  Jd(-)  over  Vv  (8.6) 

where  Vv  :=  {x  G  V  :  Jc(n)  <  V}. 

Assume  the  problem  to  be  feasible  and  non-trivial,  i.e.,  Vv  is  non-empty  and  the  policies  which 
minimize  Jd  are  not  in  Vv-  In  that  case,  Nain  and  Ross  [22]  showed  that  there  exist  two  strict 
priority  policies  ~g  and  g  and  a  bias  rj*  satisfying  the  equation 

UP)  =  v,  r)  in  [0, 1]  (8.7) 

such  that  P  defined  through  (4.3)  is  optimal.  While  the  policies  g  and  g  can  be  found  explicitly, 
the  determination  of  rj*  is  a  difficult  task  since  for  0  <  77  <  1  the  evaluation  of  JC(P)  requires 
solving  a  Riemann-Hilbert  problem.  That  this  computational  difficulty  can  be  circumvented  by 
making  using  of  a  stochastic  approximation-based  policy  is  the  content  of  the  following. 

Theorem  8.1  Assume  (R1)-(R5)  with  p  <  1  and  7  >  5.  The  scheme  (4.5)-(J,.6)  solves  the 
constrained  optimization  problem  (Pv)  provided  it  is  feasible. 
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Proof:  Nain  and  Ross  [22,  Thm.  3.1,  pp.  885-886]  showed  that  if  the  problem  is  feasible  and  non¬ 
trivial,  then  there  exist  Markov  stationary  policies  7j  and  g_  such  that  (8.7)  has  at  least  one  solution. 
In  fact,  both  policies  are  fixed  priority  policies  with  g  giving  highest  priority  to  queue  K,  and  g 
giving  lowest  priority  to  queue  K,  while  the  relative  priorities  of  the  other  queues  are  otherwise 
identical.  Moreover,  the  mapping  rj  — ►  Jd(P)  is  monotone  non-decreasing.  It  is  shown  in  Lemma 
8.2  below  that  this  mapping  is  in  fact  strictly  monotone  increasing.  When  7  >  5,  the  conditions 
of  Theorems  4.1  and  4.2  are  readily  verified  with  6=1.  Hence,  limn  g{n )  =  g*  Pa- a.s.  so  that 
Jc(a )  =  JC{P *)  =  V  and  Jd(a)  =  Jd(P *),  he.,  a  is  a  policy  in  Vv  and  is  thus  also  constrained 
optimal. 

If  the  problem  is  trivial,  i.e.,  Jc(</)  <  V,  then  g  solves  (fV).  In  that  case,  the  same  arguments 
imply  that  lim„  g{n)  =  1  Pa  —  a.s.,  and  optimality  follows.  ■ 


In  the  case  K  —  2,  the  two  policies  ~g  and  g  are  necessarily  the  fixed  priority  rules  for  queue 
1  and  2,  respectively.  In  this  case,  the  adaptive  policy  does  not  assume  any  prior  information  on 
the  statistics  of  the  system,  provided  (R1)-(R5)  hold  with  7  >  5.  In  this  case,  the  optimality  was 
obtained  by  Shwartz  and  Makowski  [25]  under  a  slightly  weaker  assumption  (namely  7  >  3),  but 
the  convergence  (4.10)  was  only  in  probability. 


This  section  concludes  with  the  following  monotonicity  result  which  was  needed  in  the  proof 
of  Theorem  8.1. 

Lemma  8.2.  Under  (R1)-(R5)  the  mapping  g  — ►  JC(P )  is  strictly  monotone  increasing  on  [0, 1]. 
Proof:  It  is  plain  from  (6.8)  that  proving  the  strict  monotonicity  of  g  -*•  JC{P )  is  equivalent  to 
proving  the  same  for  g  — ►  C(g,  0).  Fix  g  in  [0,1]  and  recall  the  definition  (4.3)  of  the  policy  fn. 
The  representation  (6.22)  of  the  derivative  of  Cm(g,0)  can  be  written  in  the  form 


m  —  1 


Cm(V,  0)= 


(=0 

m— 1  m 


e-i 


U=1 


=  £  £ 

t= 0  e=t+ 1 


£l[T  =  f]]Tl[T<m]XK(s) 

5  =  0 

i-\ 

En*  |1  [T  =  (\YiXK{s) 


—  E0t 


e-i 


Ji[r  =  flX;i[r<mM 


U=l 


s=0 


s= 0 


E°0‘ 


e-i 


1  [T  =  l]^XK(,) 


s= 0 


(8.8) 


where  (6.23)  was  used.  If  it  were  possible  to  show  bounds  of  the  form 

A (t,t,s)  :=  El‘  [1[T  =  e\XK(s)]  -  E%'  [1  [T  =  f]X/c(s)]  >  (8.9) 

with  e(£,t,s)  >  0  for  all  0  <  s  <  l  and  0  <  t  <  l,  and  e(f,  t,s)  >  0  for  at  least  one  such  triple 
(£,  t,s),  then  necessarily  for  some  m,  0  <  Cm(g,  0)  <  Cm+ 1  (77, 0)  <  . . .  and  the  strict  monotonicity 
would  follow  from  the  second  equality  in  (6.28). 
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Fix  t  and  l  such  that  0  <  t  <  L  It  is  easy  to  see  that  A (£,t,s)  =  0  whenever  0  <  s  <  t  <  £, 
so  that  only  the  case  0  <  t  <  s  has  to  be  considered  in  order  to  prove  (8.9).  This  is  done  by  the 
following  coupling  arguments. 

Let  P  be  a  probability  measure  on  (ft,P)  under  which  (P1)-(P4)  hold  and  X(0)  =  0.  More¬ 
over,  let  {/ 3(n ),  n  =  0, 1, . . .}  be  a  sequence  of  i.i.d.  Bernoulli  rvs  with  parameter  g  which  is  also 
independent  of  the  rvs  {A(n),  B(n),  n  =  0, 1, . . .}  under  P. 

The  key  point  of  the  proof  is  to  construct  on  ft  a  pair  of  processes  {T°(n),  n  =  0, 1, . . .}  and 
{X*(n),  n  =  0,1,...}  such  that  (i)  {X°(n),  n  =  0,1,...}  (resp.  {X1(n),  n  =  0,1,...})  under  P 
is  statistically  indistinguishable  from  {X(n),  n  =  0, 1, . . .}  under  Pq‘  (resp.  Pq‘  ),  and  (ii)  a  simple 
comparison  leads  to  (8.9).  To  that  end,  for  each  i  =  0, 1,  define  the  process  {X*(ra),  n  =  0, 1, . . .} 
by  the  recursion 

Xl(n  +  1)  =  Xl(n)  +  Ak{n)  -  I[Xl(n)  ?  0  ]£!*  (n)S£(n)  ,  1  <k<K  n  =  0,1,...  (8.10) 

with  X*(0)  =  0,  where  the  sequences  {Ul(n),  n  =  0, 1, . . .}  and  { Bl(n ),  n  =  0, 1, . . .}  still  need  to 
be  specified. 

For  i  =  0, 1,  the  control  actions  (j7‘(n),  n  =  0, 1, . . .}  are  defined  by 

Cl‘(n)  =  /3(n)p(Xl(n))  +  (l-/?(n))£(X<(n))  ,  n^t  (8.11a) 

U°(t)  =  g(X°(t ))  (8.116) 

U\t)  =  g(X\t))  (8.11c) 

so  that  the  rvs  {T°(n),  n  =  0, 1, . . .}  (resp.  {X^n),  n  =  0, 1, . . .})  are  governed  by  the  policy  0« 
(resp.  lt). 

Only  the  service  sequences  {Bl(n),  n  =  0,1,...},  i  =  0,1,  need  to  be  specified.  First, 
set  B°(n)  =  B(n )  for  all  n  =  0,1,...  and  observe  from  the  construction  (8.10)— (8.11)  that 
the  distribution  of  {X°(n),  n  =  0,1,...}  under  P  obviously  coincides  with  the  distribution  of 
{X(n),  n  =  0,1,...}  under  Pq'.  The  construction  of  the  process  (P1(n),  n  =  0, 1, ...}  is  some¬ 
what  more  involved,  and  is  done  below.  In  order  to  facilitate  the  coupling  argument,  the  actual 
service  duration  of  each  customer  will  be  defined  in  such  a  way  so  as  to  have  identical  length  (for 
each  u>  in  ft)  in  both  processes.  To  do  this,  the  rvs  Px(n)  are  defined  in  (8.12)-(8.14)  so  that  the 
number  of  unsuccessful  services  experienced  by  each  customer  is  identical  in  both  systems.  Set 

B1(n):=B(n)  n  =  0, 1, . .  .,f  -  1  (8.12) 
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and  observe  from  (8.10)  that  in  order  to  determine  the  process  {X1(n),  n  =  0,1,...},  it  suffices 
to  provide  the  values  of  Bl(n)  at  times  n  such  that  f/1(n)  =  efc,  1  <  k  <  K .  For  all  i  =  0, 1  and 
1  <  k  <  K,  set 

r£(  1)  :=  min{n  >  t :  U\n)  =  ek}  (8.13 6) 

rl(l)  :=  min{n  >  r£(£  -  1) :  tP(n)  =  ek},  l  =  2, 3, . .  .(8.136) 

and  define 

B\{rl{t))  :=  Bk{rl{i )),  1  <  k  <  K  1=  1,2, ..  .(8.14) 

With  these  definitions,  the  actual  number  of  times  each  customer  is  served  is  identical  in  both 
systems,  while  the  sequences  {B(n),  n  =  0,1,...}  (under  Pq‘)  and  (i?1(n),  n  =  0, 1,...}  (under 
P)  are  statistically  indistinguishable.  Consequently,  the  distribution  of  (X1(n),  n  =  0, 1, . . .}  under 
P  coincides  with  the  distribution  of  {X(n),  n  —  0,1,...}  under  Pj'.  Moreover,  by  construction 
(with  the  notation  of  (6.3)),  it  is  easy  to  see  that  T°  =  T 1  and  X^-(n)  <  Xft(n)  for  all  n  =  0, 1, . . . 
P  a.s.,  whence 

A(^,i,a)  =  £[l[r1  W(s)-X?f(s))]  >0.  (8.15) 

Finally,  for  s  =  t  +  1,  observe  that  on  the  event 

A  :=  [T°  =  e]n  [X%(t)  ?  0]  fl  [xg(<)  ^  0  for  some  k  =  1, 2, . . .,  K  -  l]  n  [BK(t)  =  1] ,  (8.16) 
the  equality  X}<(t  +  1)  —  Xj^(t  +  1)  =  1  holds,  and  that  P[^4]  >  0.  Consequently, 

E  [lfT1  =  t]  [X},(s)  -  X°K(s) ]]  :=  e(e,t,t  +  1)  >  P[A]  >  0  (8.17) 


and  the  result  is  established. 
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